CS236781: Deep Learning on Computational Accelerators¶

Homework Assignment 4¶

Faculty of Computer Science, Technion.

Submitted by:

# Name Id email
Student 1 [your name here] [your id here] [your email here]
Student 2 [your name here] [your id here] [your email here]

Introduction¶

In this assignment we'll explore deep reinforcement learning. We'll implement two popular and related methods for directly learning the policy of an agent for playing a simple video game. Then we'll focus our attention on image generation and implement two different generative models: A variational autoencoder and a generative adversarial network.

General Guidelines¶

  • Please read the getting started page on the course website. It explains how to setup, run and submit the assignment.
  • This assignment requires running on GPU-enabled hardware. Please read the course servers usage guide. It explains how to use and run your code on the course servers to benefit from training with GPUs.
  • The text and code cells in these notebooks are intended to guide you through the assignment and help you verify your solutions. The notebooks do not need to be edited at all (unless explicitly specified). The only exception is to fill your name(s) in the above cell before submission. Please do not remove sections or change the order of any cells.
  • All your code (and even answers to questions) should be written in the files within the python package corresponding the assignment number (hw1, hw2, etc). You can of course use any editor or IDE to work on these files.

Contents¶

  • Part1: Deep Reinforcement Learning
  • Part 2: Variational Autoencoder
  • Part 3: Generative Adversarial Networks
  • Part4: Summary Questions
$$ \newcommand{\mat}[1]{\boldsymbol {#1}} \newcommand{\mattr}[1]{\boldsymbol {#1}^\top} \newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}} \newcommand{\vec}[1]{\boldsymbol {#1}} \newcommand{\vectr}[1]{\boldsymbol {#1}^\top} \newcommand{\rvar}[1]{\mathrm {#1}} \newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{\diag}{\mathop{\mathrm {diag}}} \newcommand{\set}[1]{\mathbb {#1}} \newcommand{\cset}[1]{\mathcal{#1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\bb}[1]{\boldsymbol{#1}} \newcommand{\E}[2][]{\mathbb{E}_{#1}\left[#2\right]} \newcommand{\ip}[3]{\left<#1,#2\right>_{#3}} \newcommand{\given}[]{\,\middle\vert\,} \newcommand{\DKL}[2]{\cset{D}_{\text{KL}}\left(#1\,\Vert\, #2\right)} \newcommand{\grad}[]{\nabla} $$

Part 1: Deep Reinforcement Learning¶

In the tutorial we have seen value-based reinforcement learning, in which we learn to approximate the action-value function $q(s,a)$.

In this exercise we'll explore a different approach, directly learning the agent's policy distribution, $\pi(a|s)$ by using policy gradients, in order to safely land on the moon!

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import unittest
import os
import sys
import pathlib
import urllib
import shutil
import re

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt
In [2]:
test = unittest.TestCase()
plt.rcParams.update({'font.size': 12})
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Prefer CPU, GPU won't help much in this assignment
device = 'cpu'
print('Using device:', device)

# Seed for deterministic tests
SEED = 42
Using device: cpu

Some technical notes before we begin:

  • This part does not require a GPU. We won't need large models, and the computation bottleneck will be the generation of episodes to train on.
  • In order to run this notebook on the server, you must prepend the xvfb-run command to create a virtual screen. For example,
    • to run this notebook with srun do
        srun -c2 --gres=gpu:1 xvfb-run -a -s "-screen 0 1440x900x24" python main.py run-nb <filename>
    • To run the submission script, do
        srun -c2 xvfb-run -a -s "-screen 0 1440x900x24" python main.py prepare-submission ...
    • note that we have already included the xvfb-run command inside the jupyter-lab.sh script, so you can use it as usual with srun. and so on.
  • The OpenAI gym library is not officially supported on windows. However it should be possible to install and run the necessary environment for this exercise. However, we cannot provide you with technical support for this. If you have trouble installing locally, we suggest running on the course server.
  • When running the gym environment locally (i.e. not on the course server), an interactive window should appear, showing you the gameplay. There's currently a known issue when running this through jupyter: the window may remain open and seem stuck after the episode completes. If it happens, this is OK, you can keep running the notebook and the rest of the cells wont be affected. The Window will close properly when you shut down the kernel.

Policy gradients¶

Recall from the tutorial that we define the policy of an agent as the conditional distribution, $$ \pi(a|s) = \Pr(a_t=a\vert s_t=s), $$ which defines how likely the agent is to take action $a$ at state $s$.

Furthermore we define the action-value function, $$ q_{\pi}(s,a) = \E{g_t(\tau)|s_t = s,a_t=a,\pi} $$ where $$ g_t(\tau) = r_{t+1}+\gamma r_{t+2} + \dots = \sum_{k=0}^{\infty} \gamma^k r_{t+1+k}, $$ is the total discounted reward of a specific trajectory $\tau$ from time $t$, and the expectation in $q$ is over all possible trajectories, $ \tau=\left\{ (s_0,a_0,r_1,s_1), \dots (s_T,a_T,r_{T+1},s_{T+1}) \right\}. $

In the tutorial we saw that we can learn a value function starting with some random function and updating it iteratively by using the Bellman optimality equation. Given that we have some action-value function, we can immediately create a policy based on that by simply selecting an action which maximize the action-value at the current state, i.e. $$ \pi(a|s) = \begin{cases} 1, & a = \arg\max_{a'\in\cset{A}} q(s,a') \\ 0, & \text{else} \end{cases}. $$ This is called $q$-learning. This approach aims to obtain a policy indirectly through the action-value function. Yet, in most cases we don't actually care about knowing the value of particular states, since all we need is a good policy for our agent.

Here we'll take a different approach and learn a policy distribution $\pi(a|s)$ directly - by using policy gradients.

Formalism¶

We define a parametric policy, $\pi_\vec{\theta}(a|s)$, and maximize total discounted reward (or minimize the negative reward): $$ \mathcal{L}(\vec{\theta})=\E[\tau]{-g(\tau)|\pi_\vec{\theta}} = -\int g(\tau)p(\tau|\vec{\theta})d\tau, $$ where $p(\tau|\vec{\theta})$ is the probability of a specific trajectory $\tau$ under the policy defined by $\vec{\theta}$.

Since we want to find the parameters $\vec{\theta}$ which minimize $\mathcal{L}(\vec{\theta})$, we'll compute the gradient w.r.t. $\vec{\theta}$: $$ \grad\mathcal{L}(\vec{\theta}) = -\int g(\tau)\grad p(\tau|\vec{\theta})d\tau. $$

Unfortunately, if we try to write $p(\tau|\vec{\theta})$ explicitly, we find that computing it's gradient with respect to $\vec{\theta}$ is quite intractable due to a huge product of terms depending on $\vec{\theta}$: $$ p(\tau|\vec{\theta})=p\left(\left\{ (s_t,a_t,r_{t+1},s_{t+1})\right\}_{t\geq0}\given\vec{\theta}\right) =p(s_0)\prod_{t\geq0} \pi_{\vec{\theta}}(a_t|s_t)p(s_{t+1}|s_t,a_t). $$

However, by using the fact that $\grad_{x}\log(f(x))=\frac{\grad_{x}f(x)}{f(x)}$, we can convert the product into a sum: $$ \begin{align} \grad\mathcal{L}(\vec{\theta}) &= -\int g(\tau)\grad p(\tau|\vec{\theta})d\tau = -\int g(\tau)\frac{\grad p(\tau|\vec{\theta})}{p(\tau|\vec{\theta})}p(\tau|\vec{\theta})d\tau \\ &= -\int g(\tau)\grad\log\left(p(\tau|\vec{\theta})\right)p(\tau|\vec{\theta})d\tau \\ &= -\int g(\tau)\grad\log\left( p(s_0)\prod_{t\geq0} \pi_{\vec{\theta}}(a_t|s_t)p(s_{t+1}|s_t,a_t) \right) p(\tau|\vec{\theta})d\tau \\ &= -\int g(\tau)\grad\left( \log p(s_0) + \sum_{t\geq0} \log \pi_{\vec{\theta}}(a_t|s_t) + \sum_{t\geq0}\log p(s_{t+1}|s_t,a_t) \right) p(\tau|\vec{\theta})d\tau \\ &= -\int g(\tau)\sum_{t\geq0} \grad\log \pi_{\vec{\theta}}(a_t|s_t) p(\tau|\vec{\theta})d\tau \\ &= \E[\tau]{-g(\tau)\sum_{t\geq0} \grad\log \pi_{\vec{\theta}}(a_t|s_t)}. \end{align} $$

This is the "vanilla" version of the policy gradient. We can interpret is as a weighted log-likelihood function. The log-policy is the log-likelihood term we wish to maximize and the total discounted reward acts as a weight: high-return positive trajectories will cause the probability of actions taken during them to increase, and negative-return trajectories will cause the probabilities of actions taken to decrease.

In the following figures we see three trajectories: high-return positive-reward (green), low-return positive-reward (yellow) and negative-return (red) and the action probabilities along the trajectories after the update. Credit: Sergey Levine.

The major drawback of the policy-gradient is it's high variance, which causes erratic optimization behavior and therefore slow convergence. One reason for this is that the log-policy weight term, $g(\tau)$ can vary wildly between different trajectories, even if they're similar in actions. Later on we'll implement the loss and explore some methods of variance reduction.

Landing on the moon with policy gradients¶

In the spirit of the recent achievements of the Israeli space industry, we'll apply our reinforcement learning skills to solve a simple game called LunarLander.

This game is available as an environment in OpenAI gym.

In this environment, you need to control the lander and get it to land safely on the moon. To do so, you must apply bottom, right or left thrusters (each are either fully on or fully off) and get it to land within the designated zone as quickly as possible and with minimal wasted fuel.

In [3]:
import gym

# Just for fun :) ... but also to re-define the default max number of steps
ENV_NAME = 'Beresheet-v2'
MAX_EPISODE_STEPS = 300
if ENV_NAME not in gym.envs.registry.env_specs:
    gym.register(
        id=ENV_NAME,
        entry_point='gym.envs.box2d:LunarLander',
        max_episode_steps=MAX_EPISODE_STEPS,
        reward_threshold=200,
    )
In [4]:
import gym

env = gym.make(ENV_NAME)

print(env)
print(f'observations space: {env.observation_space}')
print(f'action space: {env.action_space}')

ENV_N_ACTIONS = env.action_space.n
ENV_N_OBSERVATIONS = env.observation_space.shape[0]
<TimeLimit<LunarLander<Beresheet-v2>>>
observations space: Box([-inf -inf -inf -inf -inf -inf -inf -inf], [inf inf inf inf inf inf inf inf], (8,), float32)
action space: Discrete(4)

The observations at each step is the Lander's position, velocity, angle, angular velocity and ground contact state. The actions are no-op, fire left truster, bottom thruster and right thruster.

You are highly encouraged to read the documentation in the source code of the LunarLander environment to understand the reward system, and see how the actions and observations are created.

Policy network and Agent¶

Let's start with our policy-model. This will be a simple neural net, which should take an observation and return a score for each possible action.

TODO:

  1. Implement all methods in the PolicyNet class in the hw4/rl_pg.py module. Start small. A simple MLP with a few hidden layers is a good starting point. You can come back and change it later based on the the experiments.
    Notice that we'll use the build_for_env method to instantiate a PolicyNet based on the configuration of a given environment.
  2. If you need hyperparameters to configure your model (e.g. number of hidden layers, sizes, etc.), add them in part1_pg_hyperparams() in hw4/answers.py.
In [5]:
print(env.observation_space.sample())
print(env.unwrapped.action_space)
[ 2.5857196  -0.9514567   0.05757594 -0.4913913   1.2619035  -0.69068193
  0.80056274  1.9852805 ]
Discrete(4)
In [6]:
import hw4.rl_pg as hw4pg
import hw4.answers

hp = hw4.answers.part1_pg_hyperparams()

# You can add keyword-args to this function which will be populated from the
# hyperparameters dict.
p_net = hw4pg.PolicyNet.build_for_env(env, device, **hp)
p_net
Out[6]:
PolicyNet(
  (fc): Sequential(
    (0): Linear(in_features=8, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=4, bias=True)
  )
)

Now we need an agent. The purpose of our agent will be to act according to the current policy and generate experiences. Our PolicyAgent will use a PolicyNet as the current policy function.

We'll also define some extra datatypes to help us represent the data generated by our agent. You can find the Experience, Episode and TrainBatch datatypes in the hw4/rl_data.py module.

TODO: Implement the current_action_distribution() method of the PolicyAgent class in the hw4/rl_pg.py module.

In [7]:
for i in range (10):
    agent = hw4pg.PolicyAgent(env, p_net, device)
    d = agent.current_action_distribution()
    
    test.assertSequenceEqual(d.shape, (env.action_space.n,))
    test.assertAlmostEqual(d.sum(), 1.0, delta=1e-5)
    
print(d)
tensor([0.2590, 0.2205, 0.2669, 0.2537])

TODO: Implement the step() method of the PolicyAgent.

In [8]:
agent = hw4pg.PolicyAgent(env, p_net, device)
exp = agent.step()

test.assertIsInstance(exp, hw4pg.Experience)
print(exp)
Experience(state=tensor([-0.0035,  1.4218, -0.3593,  0.4822,  0.0041,  0.0814,  0.0000,  0.0000]), action=3, reward=1.3705129569685834, is_done=False)

To test our agent, we'll write some code that allows it to play an environment. We'll use the Monitor wrapper in gym to generate a video of the episode for visual debugging.

TODO: Complete the implementation of the monitor_episode() method of the PolicyAgent.

In [9]:
env, n_steps, reward = agent.monitor_episode(ENV_NAME, p_net, device=device)

To display the Monitor video in this notebook, we'll use a helper function from our jupyter_utils and a small wrapper that extracts the path of the last video file.

In [10]:
import cs236781.jupyter_utils as jupyter_utils

def show_monitor_video(monitor_env, idx=0, **kw):
    # Extract video path
    video_path = monitor_env.videos[idx][0]
    video_path = os.path.relpath(video_path, start=os.path.curdir)
    
    # Use helper function to embed the video
    return jupyter_utils.show_video_in_notebook(video_path, **kw)
In [11]:
print(f'Episode ran for {n_steps} steps. Total reward: {reward:.2f}')

show_monitor_video(env, idx=0)
Episode ran for 61 steps. Total reward: -85.92
Out[11]:

Training data¶

The next step is to create data to train on. We need to train on batches of state-action pairs, so that our network can learn to predict the actions.

We'll split this task into three parts:

  1. Generate a batch of Episodes, by using an Agent that's playing according to our current policy network. Each Episode object contains the Experience objects created by the agent.
  2. Calculate the total discounted reward for each state we encountered and action we took. This is our action-value estimate.
  3. Convert the Episodes into a batch of tensors to train on. Each batch will contain states, action taken per state, reward accrued, and the calculated estimated state-values. These will be stored in a TrainBatch object.

TODO: Complete the implementation of the episode_batch_generator() method in the TrainBatchDataset class within the hw4.rl_data module. This will address part 1 in the list above.

In [12]:
import hw4.rl_data as hw4data

def agent_fn():
    env = gym.make(ENV_NAME)
    hp = hw4.answers.part1_pg_hyperparams()
    p_net = hw4pg.PolicyNet.build_for_env(env, device, **hp)
    return hw4pg.PolicyAgent(env, p_net, device)
    
ds = hw4data.TrainBatchDataset(agent_fn, episode_batch_size=8, gamma=0.9)
batch_gen = ds.episode_batch_generator()
b = next(batch_gen)
print('First episode:', b[0])

test.assertEqual(len(b), 8)
for ep in b:
    test.assertIsInstance(ep, hw4data.Episode)
    
    # Check that it's a full episode
    is_done = [exp.is_done for exp in ep.experiences]
    test.assertFalse(any(is_done[0:-1]))
    test.assertTrue(is_done[-1])
First episode: Episode(total_reward=-174.26, #experences=65)

TODO: Complete the implementation of the calc_qvals() method in the Episode class. This will address part 2. These q-values are an estimate of the actual action value function: $$\hat{q}_{t} = \sum_{t'\geq t} \gamma^{t'-t}r_{t'+1}.$$

In [13]:
np.random.seed(SEED)
test_rewards = np.random.randint(-10, 10, 100)
test_experiences = [hw4pg.Experience(None,None,r,False) for r in test_rewards] 
test_episode = hw4data.Episode(np.sum(test_rewards), test_experiences)

qvals = test_episode.calc_qvals(0.9)
qvals = list(qvals)

expected_qvals = np.load(os.path.join('tests', 'assets', 'part1_expected_qvals.npy'))
for i in range(len(test_rewards)):
    test.assertAlmostEqual(expected_qvals[i], qvals[i], delta=1e-3)

TODO: Complete the implementation of the from_episodes() method in the TrainBatch class. This will address part 3.

Notes:

  • The TrainBatchDataset class provides a generator function that will use the above function to lazily generate batches of training samples and labels on demand.
  • This allows us to use a standard PyTorch dataloader to wrap our Dataset and provide us with parallel data loading for free! This means we can run multiple environments with multiple agents in separate background processes to generate data for training and thus prevent the data loading bottleneck which is caused by the fact that we must generate full Episodes to train on in order to calculate the q-values.
  • We'll set the DataLoader's batch_size to None because we have already implemented custom batching in our dataset.
  • You can choose the number of worker processes generating data using the num_workers parameter in the hyperparams dict. Set num_workers=0 to disable parallelization.
In [14]:
from torch.utils.data import DataLoader

hp = hw4.answers.part1_pg_hyperparams()

ds = hw4data.TrainBatchDataset(agent_fn, episode_batch_size=8, gamma=0.9)
dl = DataLoader(
    ds,
    batch_size=None,
    num_workers=hp['num_workers'],
    multiprocessing_context='fork' if hp['num_workers'] > 0 else None
)


for i, train_batch in enumerate(dl):
    states, actions, qvals, reward_mean = train_batch
    print(f'#{i}: {train_batch}', end="\n\n")
    test.assertEqual(states.shape[0], actions.shape[0])
    test.assertEqual(qvals.shape[0], actions.shape[0])
    test.assertEqual(states.shape[1], env.observation_space.shape[0])
    if i > 1:
        break
#0: TrainBatch(states: torch.Size([754, 8]), actions: torch.Size([754]), q_vals: torch.Size([754])), num_episodes: 8)

#1: TrainBatch(states: torch.Size([768, 8]), actions: torch.Size([768]), q_vals: torch.Size([768])), num_episodes: 8)

#2: TrainBatch(states: torch.Size([732, 8]), actions: torch.Size([732]), q_vals: torch.Size([732])), num_episodes: 8)

Loss functions¶

As usual, we need a loss function to optimize over. We'll calculate three types of losses:

  1. The causal vanilla policy gradient loss.
  2. The policy gradient loss, with a baseline to reduce variance.
  3. An entropy-based loss whos purpose is to diversify the agent's action selection, and prevent it from being "too sure" about its actions. This loss will be used together with one of the above losses.

Causal vanilla policy-gradient¶

We have derived the policy-gradient as $$ \grad\mathcal{L}(\vec{\theta}) = \E[\tau]{-g(\tau)\sum_{t\geq0} \grad\log \pi_{\vec{\theta}}(a_t|s_t)}. $$

By writing the discounted reward explicitly and enforcing causality, i.e. the action taken at time $t$ can't affect the reward at time $t'<t$, we can get a slightly lower-variance version of the policy gradient:

$$ \grad\mathcal{L}_{\text{PG}}(\vec{\theta}) = \E[\tau]{-\sum_{t\geq0} \left(\sum_{t'\geq t} \gamma^{t'-t}r_{t'+1} \right)\grad\log \pi_{\vec{\theta}}(a_t|s_t)}. $$

In practice, the expectation over trajectories is calculated using a Monte-Carlo approach, i.e. simply sampling $N$ trajectories and average the term inside the expectation. Therefore, we will use the following estimated version of the policy gradient:

$$ \begin{align} \hat\grad\mathcal{L}_{\text{PG}}(\vec{\theta}) &=-\frac{1}{N}\sum_{i=1}^{N}\sum_{t\geq0} \left(\sum_{t'\geq t} \gamma^{t'-t}r_{i,t'+1} \right)\grad\log \pi_{\vec{\theta}}(a_{i,t}|s_{i,t}) \\ &=-\frac{1}{N}\sum_{i=1}^{N}\sum_{t\geq0} \hat{q}_{i,t} \grad\log \pi_{\vec{\theta}}(a_{i,t}|s_{i,t}). \end{align} $$

Note the use of the notation $\hat{q}_{i,t}$ to represent the estimated action-value at time $t$ in the sampled trajectory $i$. Here $\hat{q}_{i,t}$ is acting as the weight-term for the policy gradient.

TODO: Complete the implementation of the VanillaPolicyGradientLoss class in the hw4/rl_pg.py module.

In [15]:
# Ensure deterministic run
env = gym.make(ENV_NAME)
env.seed(SEED)
torch.manual_seed(SEED)

def agent_fn():
    # Use a simple "network" here, so that this test doesn't depend on
    # your specific PolicyNet implementation
    p_net_test = nn.Linear(ENV_N_OBSERVATIONS, ENV_N_ACTIONS, bias=True)
    agent = hw4pg.PolicyAgent(env, p_net_test)
    return agent

dataloader = hw4data.TrainBatchDataset(agent_fn, gamma=0.9, episode_batch_size=4)

test_batch = next(iter(dataloader))
test_action_scores = torch.randn(len(test_batch), env.action_space.n)
print(f"{test_batch=}", end='\n\n')
print(f"test_action_scores=\n{test_action_scores}\nshape={test_action_scores.shape}", end='\n\n')

loss_fn_p = hw4pg.VanillaPolicyGradientLoss()
loss_p, _ = loss_fn_p(test_batch, test_action_scores)

print(f'{loss_p=}')
test.assertAlmostEqual(loss_p.item(), -48.560, delta=1e-2)
test_batch=TrainBatch(states: torch.Size([375, 8]), actions: torch.Size([375]), q_vals: torch.Size([375])), num_episodes: 4)

test_action_scores=
tensor([[ 0.8932,  0.4749,  0.8569, -0.7365],
        [-0.7853,  1.0901, -0.0665,  1.2573],
        [ 0.0867, -1.2705, -0.1987, -0.4103],
        ...,
        [-0.7778, -2.4352,  0.1117,  0.9482],
        [-1.4593, -0.0609, -0.1148,  1.5804],
        [ 1.2975, -0.3326, -1.0626,  0.3869]])
shape=torch.Size([375, 4])

loss_p=tensor(-48.5605, dtype=torch.float64)

Policy-gradient with baseline¶

Another way to reduce the variance of our gradient is to use relative weighting of the log-policy instead of absolute reward values. $$ \hat\grad\mathcal{L}_{\text{BPG}}(\vec{\theta}) =-\frac{1}{N}\sum_{i=1}^{N}\sum_{t\geq0} \left(\hat{q}_{i,t}-b\right) \grad\log \pi_{\vec{\theta}}(a_{i,t}|s_{i,t}). $$ In other words, we don't measure a trajectory's worth by it's total reward, but by how much better that total reward is relative to some expected ("baseline") reward value, denoted above by $b$. Note that subtracting a baseline has no effect on the expected value of the policy gradient. It's easy to prove this directly by definition.

Here we'll implement a very simple baseline (not optimal in terms of variance reduction): the average of the estimated state-values $\hat{q}_{i,t}$.

TODO: Complete the implementation of the BaselinePolicyGradientLoss class in the hw4/rl_pg.py module.

In [16]:
# Using the same batch and action_scores from above cell
loss_fn_p = hw4pg.BaselinePolicyGradientLoss()
loss_p, loss_dict = loss_fn_p(test_batch, test_action_scores)

print(f'{loss_dict=}')
test.assertAlmostEqual(loss_dict['baseline'], -29.841, delta=1e-2)
test.assertAlmostEqual(loss_p.item(), 1.297, delta=1e-2)
loss_dict={'loss_p': 1.2976918766803833, 'baseline': -29.841257246972788}

Entropy loss¶

The entropy of a probability distribution (in our case the policy), is $$ H(\pi) = -\sum_{a} \pi(a|s)\log\pi(a|s). $$ The entropy is always positive and obtains it's maximum for a uniform distribution. We'll use the entropy of the policy as a bonus, i.e. we'll try to maximize it. The idea is the prevent the policy distribution from becoming too narrow and thus promote the agent's exploration.

First, we'll calculate the maximal possible entropy value of the action distribution for a set number of possible actions. This will be used as a normalization term.

TODO: Complete the implementation of the calc_max_entropy() method in the ActionEntropyLoss class.

In [17]:
loss_fn_e = hw4pg.ActionEntropyLoss(env.action_space.n)
print('max_entropy = ', loss_fn_e.max_entropy)

test.assertAlmostEqual(loss_fn_e.max_entropy, 1.38629436, delta=1e-3)
max_entropy =  1.3862943611198906

TODO: Complete the implementation of the forward() method in the ActionEntropyLoss class.

In [18]:
loss_e, _ = loss_fn_e(test_batch, test_action_scores)
print('loss = ', loss_e)

test.assertAlmostEqual(loss_e.item(), -0.8103, delta=1e-2)
loss =  tensor(-0.8106)

Training¶

We'll implement our training procedure as follows:

  1. Initialize the current policy to be a random policy.
  2. Sample $N$ trajectories from the environment using the current policy.
  3. Calculate the estimated $q$-values, $\hat{q}_{i,t} = \sum_{t'\geq t} \gamma^{t'}r_{i,t'+1}$ for each trajectory $i$.
  4. Calculate policy gradient estimate $\hat\grad\mathcal{L}(\vec{\theta})$ as defined above.
  5. Perform SGD update $\vec{\theta}\leftarrow\vec{\theta}-\eta\hat\grad\mathcal{L}(\vec{\theta})$.
  6. Repeat from step 2.

This is known as the REINFORCE algorithm.

Fortunately, we've already implemented everything we need for steps 1-4 so we need only a bit more code to put it all together.

The following block implements a wrapper, train_pg to create all the objects we need in order to train our policy gradient model.

In [19]:
import hw4.answers
from functools import partial

ENV_NAME = "Beresheet-v2"

def agent_fn_train(agent_type, p_net, seed, envs_dict):
    winfo = torch.utils.data.get_worker_info()
    wid = winfo.id if winfo else 0
    seed = seed + wid if seed else wid

    env = gym.make(ENV_NAME)
    envs_dict[wid] = env
    env.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)

    return agent_type(env, p_net)

def train_rl(agent_type, net_type, loss_fns, hp, seed=None, checkpoints_file=None, **train_kw):
    print(f'hyperparams: {hp}')
    
    envs = {}
    p_net = net_type(ENV_N_OBSERVATIONS, ENV_N_ACTIONS, **hp)
    p_net.share_memory()
    agent_fn = partial(agent_fn_train, agent_type, p_net, seed, envs)
    
    dataset = hw4data.TrainBatchDataset(agent_fn, hp['batch_size'], hp['gamma'])
    dataloader = DataLoader(
        dataset, batch_size=None,
        num_workers=hp['num_workers'],
        multiprocessing_context='fork' if hp['num_workers'] > 0 else None
    )
    optimizer = optim.Adam(p_net.parameters(), lr=hp['learn_rate'], eps=hp['eps'])
    
    trainer = hw4pg.PolicyTrainer(p_net, optimizer, loss_fns, dataloader, checkpoints_file)
    try:
        trainer.train(**train_kw)
    except KeyboardInterrupt as e:
        print('Training interrupted by user.')
    finally:
        for env in envs.values():
            env.close()

    # Include final model state
    training_data = trainer.training_data
    training_data['model_state'] = p_net.state_dict()
    return training_data
    
def train_pg(baseline=False, entropy=False, **train_kwargs):
    hp = hw4.answers.part1_pg_hyperparams()
    
    loss_fns = []
    if baseline:
        loss_fns.append(hw4pg.BaselinePolicyGradientLoss())
    else:
        loss_fns.append(hw4pg.VanillaPolicyGradientLoss())
    if entropy:
        loss_fns.append(hw4pg.ActionEntropyLoss(ENV_N_ACTIONS, hp['beta']))

    return train_rl(hw4pg.PolicyAgent, hw4pg.PolicyNet, loss_fns, hp, **train_kwargs)

The PolicyTrainer class implements the training loop, collects the losses and rewards and provides some useful checkpointing functionality. The training loop will generate batches of episodes and train on them until either:

  • The average total reward from the last running_mean_len episodes is greater than the target_reward, OR
  • The number of generated episodes reached max_episodes.

Most of this class is already implemented for you.

TODO:

  1. Complete the training loop by implementing the train_batch() method of the PolicyTrainer.
  2. Tweak the hyperparameters in the part1_pg_hyperparams() function within the hw4/answers.py module as needed. You get some sane defaults.

Let's check whether our model is actually training. We'll try to reach a very low (bad) target reward, just as a sanity check to see that training works. Your model should be able to reach this target reward within a few batches.

You can increase the target reward and use this block to manually tweak your model and hyperparameters a few times.

In [20]:
target_reward = -140 # VERY LOW target
#target_reward = 0
train_data = train_pg(target_reward=target_reward, seed=SEED, max_episodes=2000, running_mean_len=10)

test.assertGreater(train_data['mean_reward'][-1], target_reward)
hyperparams: {'batch_size': 32, 'gamma': 0.99, 'beta': 0.05, 'learn_rate': 0.0015, 'eps': 1e-07, 'num_workers': 0, 'hidden_dims': 512}
=== Training...
#2: step=00009036, loss_p=-103.80, m_reward(10)=-125.6 (best=-168.7):   5%| | 96

=== 🚀 SOLVED - Target reward reached! 🚀

Experimenting with different losses¶

We'll now run a few experiments to see the effect of diferent loss functions on the training dynamics. Namely, we'll try:

  1. Vanilla PG (vpg): No baseline, no entropy
  2. Baseline PG (bpg): Baseline, no entropy loss
  3. Entropy PG (epg): No baseline, with entropy loss
  4. Combined PG (cpg): Baseline, with entropy loss
In [21]:
from collections import namedtuple
from pprint import pprint
import itertools as it


ExpConfig = namedtuple('ExpConfig', ('name','baseline','entropy'))

def exp_configs():
    exp_names = ('vpg', 'epg', 'bpg', 'cpg')
    z = zip(exp_names, it.product((False, True), (False, True)))
    return (ExpConfig(n, b, e) for (n, (b, e)) in z)

pprint(list(exp_configs()))
[ExpConfig(name='vpg', baseline=False, entropy=False),
 ExpConfig(name='epg', baseline=False, entropy=True),
 ExpConfig(name='bpg', baseline=True, entropy=False),
 ExpConfig(name='cpg', baseline=True, entropy=True)]

We'll save the training data from each experiment for plotting.

In [22]:
import pickle

def dump_training_data(data, filename):
    os.makedirs(os.path.dirname(filename), exist_ok=True)
    with open(filename, mode='wb') as file:
        pickle.dump(data, file)
        
def load_training_data(filename):
    with open(filename, mode='rb') as file:
        return pickle.load(file)

Let's run the experiments! We'll run each configuration for a fixed number of episodes so that we can compare them.

Notes:

  1. Until your models start working, you can decrease the number of episodes for each experiment, or only run one experiment.
  2. The results will be saved in a file. To re-run the experiments, you can set force_run to True.
In [23]:
import math

exp_max_episodes = 4000

results = {}
training_data_filename = os.path.join('results', f'part1_exp.dat')

# Set to True to force re-run (careful! will delete old experiment results)
force_run = False

# Skip running if results file exists.
if os.path.isfile(training_data_filename) and not force_run:
    print(f'=== results file {training_data_filename} exists, skipping experiments.')
    results = load_training_data(training_data_filename)
    
else:
    for n, b, e in exp_configs():
        print(f'=== Experiment {n}')
        results[n] = train_pg(baseline=b, entropy=e, max_episodes=exp_max_episodes, post_batch_fn=None)
        
    dump_training_data(results, training_data_filename)
=== results file results/part1_exp.dat exists, skipping experiments.
In [24]:
def plot_experiment_results(results, fig=None):
    if fig is None:
        fig, _ = plt.subplots(nrows=2, ncols=2, sharex=True, figsize=(18,12))
    for i, plot_type in enumerate(('loss_p', 'baseline', 'loss_e', 'mean_reward')):
        ax = fig.axes[i]
        for exp_name, exp_res in results.items():
            if plot_type not in exp_res:
                continue
            ax.plot(exp_res['episode_num'], exp_res[plot_type], label=exp_name)
        ax.set_title(plot_type)
        ax.set_xlabel('episode')
        ax.legend()
    return fig
    
experiments_results_fig = plot_experiment_results(results)

You should see positive training dynamics in the graphs (reward going up). If you don't, use them to further update your model or hyperparams.

To pass the test, you'll need to get a best total mean reward of at least 10 in the fixed number of epochs using the combined loss. It's possible to get much higher (over 100).

In [25]:
best_cpg_mean_reward = max(results['cpg']['mean_reward'])
print(f'Best CPG mean reward: {best_cpg_mean_reward:.2f}')

test.assertGreater(best_cpg_mean_reward, 10)
Best CPG mean reward: 91.58

Now let's take a look at a gameplay video of our cpg model after the short training!

In [26]:
hp = hw4.answers.part1_pg_hyperparams()
p_net_cpg = hw4pg.PolicyNet.build_for_env(env, **hp)
p_net_cpg.load_state_dict(results['cpg']['model_state'])

env, n_steps, reward = hw4pg.PolicyAgent.monitor_episode(ENV_NAME, p_net_cpg)
print(f'{n_steps} steps, total reward: {reward:.2f}')
show_monitor_video(env)
300 steps, total reward: 54.55
Out[26]:

Advantage Actor-Critic (AAC)¶

We have seen that the policy-gradient loss can be interpreted as a log-likelihood of the policy term (selecting a specific action at a specific state), weighted by the future rewards of that choice of action.

However, naïvely weighting by rewards has significant drawbacks in terms of the variance of the resulting gradient. We addressed this by adding a simple baseline term which represented our "expected reward" so that we increase probability of actions leading to trajectories which exceed this expectation and vice-versa.

In this part we'll explore a more powerful baseline, which is the idea behind the AAC method.

The advantage function¶

Recall the definition of the state-value function $v_{\pi}(s)$ and action-value function $q_{\pi}(s,a)$:

$$ \begin{align} v_{\pi}(s) &= \E{g(\tau)|s_0 = s,\pi} \\ q_{\pi}(s,a) &= \E{g(\tau)|s_0 = s,a_0=a,\pi}. \end{align} $$

Both these functions represent the value of the state $s$. However, $v_\pi$ averages over the first action according to the policy, while $q_\pi$ fixes the first action and then continues according to the policy.

Their difference is known as the advantage function: $$ a_\pi(s,a) = q_\pi(s,a)-v_\pi(s). $$

If $a_\pi(s,a)>0$ it means that it's better (in expectation) to take action $a$ in state $s$ compared to the average action. In other words, $a_\pi(s,a)$ represents the advantage of using action $a$ in state $s$ compared to the others.

So far we have used an estimate for $q_\pi$ as our weighting term for the log-policy, with a fixed baseline per batch.

$$ \hat\grad\mathcal{L}_{\text{BPG}}(\vec{\theta}) =-\frac{1}{N}\sum_{i=1}^{N}\sum_{t\geq0} \left(\hat{q}_{i,t}-b\right) \grad\log \pi_{\vec{\theta}}(a_{i,t}|s_{i,t}). $$

Now, we will use the state value as a baseline, so that an estimate of the advantage function is our weighting term:

$$ \hat\grad\mathcal{L}_{\text{AAC}}(\vec{\theta}) =-\frac{1}{N}\sum_{i=1}^{N}\sum_{t\geq0} \left(\hat{q}_{i,t}-v_\pi(s_t)\right) \grad\log \pi_{\vec{\theta}}(a_{i,t}|s_{i,t}). $$

Intuitively, using the advantage function makes sense because it means we're weighting our policy's actions according to how advantageous they are compared to other possible actions.

But how will we know $v_\pi(s)$? We'll learn it of course, using another neural network. This is known as actor-critic learning. We simultaneously learn the policy (actor) and the value of states (critic). We'll treat it as a regression task: given a state $s_t$, our state-value network will output $\hat{v}_\pi(s_t)$, an estimate of the actual unknown state-value. Our regression targets will be the discounted rewards, $\hat{q}_{i,t}$ (see question 2), and we can use a simple MSE as the loss function, $$ \mathcal{L}_{\text{SV}} = \frac{1}{N}\sum_{i=1}^{N}\sum_{t\geq0}\left(\hat{v}_\pi(s_t) - \hat{q}_{i,t}\right)^2. $$

Implementation¶

We'll build heavily on our implementation of the regular policy-gradient method, and just add a new model class and a new loss class, with a small modification to the agent.

Let's start with the model. It will accept a state, and return action scores (as before), but also the value of that state. You can experiment with a dual-head network that has a shared base, or implement two separate parts within the network.

TODO:

  1. Implement the model as the AACPolicyNet class in the hw4/rl_ac.py module.
  2. Set the hyperparameters in the part1_aac_hyperparams() function of the hw4.answers module.
In [27]:
import hw4.rl_ac as hw4ac

hp = hw4.answers.part1_aac_hyperparams()
pv_net = hw4ac.AACPolicyNet.build_for_env(env, device, **hp)
pv_net
Out[27]:
AACPolicyNet(
  (fc): Sequential(
    (0): Linear(in_features=8, out_features=512, bias=True)
    (1): ReLU()
  )
  (policy): Sequential(
    (0): Linear(in_features=512, out_features=4, bias=True)
  )
  (value): Sequential(
    (0): Linear(in_features=512, out_features=1, bias=True)
  )
)

TODO: Complete the implementation of the agent class, AACPolicyAgent, in the hw4/rl_ac.py module.

In [28]:
agent = hw4ac.AACPolicyAgent(env, pv_net, device)
exp = agent.step()

test.assertIsInstance(exp, hw4pg.Experience)
print(exp)
Experience(state=tensor([-0.0066,  1.3987, -0.6635, -0.5428,  0.0076,  0.1503,  0.0000,  0.0000]), action=0, reward=-1.0456538864927154, is_done=False)

TODO: Implement the AAC loss function as the class AACPolicyGradientLoss in the hw4/rl_ac.py module.

In [29]:
loss_fn_aac = hw4ac.AACPolicyGradientLoss(delta=1.)
test_state_values = torch.ones(test_action_scores.shape[0], 1)
loss_t, loss_dict = loss_fn_aac(test_batch, (test_action_scores, test_state_values))

print(f'{loss_dict=}')
test.assertAlmostEqual(loss_dict['adv_m'], -30.841, delta=1e-2)
test.assertAlmostEqual(loss_t.item(), 1466.830, delta=1e-2)
loss_dict={'loss_p': -50.23126021207819, 'loss_v': 1517.0619799854114, 'adv_m': -30.84125724697279}

Experimentation¶

Let's run the same experiment as before, but with the AAC method and compare the results.

In [30]:
def train_aac(baseline=False, entropy=False, **train_kwargs):
    hp = hw4.answers.part1_aac_hyperparams()
    loss_fns = [hw4ac.AACPolicyGradientLoss(hp['delta']), hw4pg.ActionEntropyLoss(ENV_N_ACTIONS, hp['beta'])]
    return train_rl(hw4ac.AACPolicyAgent, hw4ac.AACPolicyNet, loss_fns, hp, **train_kwargs)
In [31]:
training_data_filename = os.path.join('results', f'part1_exp_aac.dat')

# Set to True to force re-run (careful, will delete old experiment results)
force_run = False

if os.path.isfile(training_data_filename) and not force_run:
    print(f'=== results file {training_data_filename} exists, skipping experiments.')
    results_aac = load_training_data(training_data_filename)
    
else:
    print(f'=== Running AAC experiment')
    training_data = train_aac(max_episodes=exp_max_episodes)
    results_aac = dict(aac=training_data)
    dump_training_data(results_aac, training_data_filename)
=== results file results/part1_exp_aac.dat exists, skipping experiments.
In [32]:
experiments_results_fig = plot_experiment_results(results)
plot_experiment_results(results_aac, fig=experiments_results_fig);

You should get better results with the AAC method, so this time the bar is higher (again, you should aim for a mean reward of 100+). Compare the graphs with combined PG method and see if they make sense.

In [33]:
best_aac_mean_reward = max(results_aac['aac']['mean_reward'])
print(f'Best AAC mean reward: {best_aac_mean_reward:.2f}')

test.assertGreater(best_aac_mean_reward, 50)
Best AAC mean reward: 86.91

Final model training and visualization¶

Now, using your best model and hyperparams, let's train model for much longer and see the performance. Just for fun, we'll also visualize an episode every now and then so that we can see how well the agent is playing.

TODO:

  • Run the following block to train.
  • Tweak model or hyperparams as necessary.
  • Aim for high mean reward, at least 150+. It's possible to get over 200.
  • When training is done and you're satisfied with the model's outputs, rename the checkpoint file by adding _final to the file name. This will cause the block to skip training and instead load your saved model when running the homework submission script. Note that your submission zip file will not include the checkpoint file. This is OK.
In [34]:
import IPython.display

CHECKPOINTS_FILE = f'checkpoints/{ENV_NAME}-ac.dat'
CHECKPOINTS_FILE_FINAL = f'checkpoints/{ENV_NAME}-ac_final.dat'
TARGET_REWARD = 125
MAX_EPISODES = 15_000

def post_batch_fn(batch_idx, p_net, batch, print_every=20, final=False):
    if not final and batch_idx % print_every != 0:
        return
    env, n_steps, reward = hw4ac.AACPolicyAgent.monitor_episode(ENV_NAME, p_net)
    html = show_monitor_video(env, width="500")
    IPython.display.clear_output(wait=True)
    print(f'Monitor@#{batch_idx}: n_steps={n_steps}, total_reward={reward:.3f}, final={final}')
    IPython.display.display_html(html)
    
    
if os.path.isfile(CHECKPOINTS_FILE_FINAL):
    print(f'=== {CHECKPOINTS_FILE_FINAL} exists, skipping training...')
    checkpoint_data = torch.load(CHECKPOINTS_FILE_FINAL)
    hp = hw4.answers.part1_aac_hyperparams()
    pv_net = hw4ac.AACPolicyNet.build_for_env(env, **hp)
    pv_net.load_state_dict(checkpoint_data['params'])
    print(f'=== Running best model...')
    env, n_steps, reward = hw4ac.AACPolicyAgent.monitor_episode(ENV_NAME, pv_net)
    print(f'=== Best model ran for {n_steps} steps. Total reward: {reward:.2f}')
    IPython.display.display_html(show_monitor_video(env))
    best_mean_reward = checkpoint_data["best_mean_reward"]
else:
    print(f'=== Starting training...')
    train_data = train_aac(TARGET_REWARD, max_episodes=MAX_EPISODES,
                           seed=None, checkpoints_file=CHECKPOINTS_FILE, post_batch_fn=post_batch_fn)
    print(f'=== Done, ', end='')
    best_mean_reward = train_data["best_mean_reward"][-1]
    print(f'num_episodes={train_data["episode_num"][-1]}, best_mean_reward={best_mean_reward:.1f}')
          
test.assertGreaterEqual(best_mean_reward, TARGET_REWARD)
Monitor@#1120: n_steps=295, total_reward=289.778, final=False
#1140: step=03834558, loss_p= -7.20, loss_v= 15.31, adv_m=-13.63, loss_e= -0.00,
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Input In [34], in <module>
     29 else:
     30     print(f'=== Starting training...')
---> 31     train_data = train_aac(TARGET_REWARD, max_episodes=MAX_EPISODES,
     32                            seed=None, checkpoints_file=CHECKPOINTS_FILE, post_batch_fn=post_batch_fn)
     33     print(f'=== Done, ', end='')
     34     best_mean_reward = train_data["best_mean_reward"][-1]

Input In [30], in train_aac(baseline, entropy, **train_kwargs)
      2 hp = hw4.answers.part1_aac_hyperparams()
      3 loss_fns = [hw4ac.AACPolicyGradientLoss(hp['delta']), hw4pg.ActionEntropyLoss(ENV_N_ACTIONS, hp['beta'])]
----> 4 return train_rl(hw4ac.AACPolicyAgent, hw4ac.AACPolicyNet, loss_fns, hp, **train_kwargs)

Input In [19], in train_rl(agent_type, net_type, loss_fns, hp, seed, checkpoints_file, **train_kw)
     35 trainer = hw4pg.PolicyTrainer(p_net, optimizer, loss_fns, dataloader, checkpoints_file)
     36 try:
---> 37     trainer.train(**train_kw)
     38 except KeyboardInterrupt as e:
     39     print('Training interrupted by user.')

File ~/Documents/236781/hw3/Deep_Learning_CS/hw4/hw4/rl_pg.py:420, in PolicyTrainer.train(self, target_reward, running_mean_len, max_episodes, post_batch_fn)
    418 if episode_num >= max_episodes:
    419     terminate = f"\n=== STOPPING - Max episode reached"
--> 420 post_batch_fn(i, self.model, batch, final=terminate is not None)
    421 if terminate:
    422     break

Input In [34], in post_batch_fn(batch_idx, p_net, batch, print_every, final)
      9 if not final and batch_idx % print_every != 0:
     10     return
---> 11 env, n_steps, reward = hw4ac.AACPolicyAgent.monitor_episode(ENV_NAME, p_net)
     12 html = show_monitor_video(env, width="500")
     13 IPython.display.clear_output(wait=True)

File ~/Documents/236781/hw3/Deep_Learning_CS/hw4/hw4/rl_pg.py:155, in PolicyAgent.monitor_episode(cls, env_name, p_net, monitor_dir, device)
    146 n_steps, reward = 0, 0.0
    147 with gym.wrappers.Monitor(
    148     gym.make(env_name), monitor_dir, video_callable=None, force=True
    149 ) as env:
   (...)
    153     # ====== YOUR CODE: ======
    154     #agent = PolicyAgent(env, p_net, device)
--> 155     agent = cls(env, p_net, device)
    156     is_done = False
    157     n_steps = 0

File ~/Documents/236781/hw3/Deep_Learning_CS/hw4/hw4/rl_pg.py:79, in PolicyAgent.__init__(self, env, p_net, device)
     77 self.curr_state = None
     78 self.curr_episode_reward = None
---> 79 self.reset()

File ~/Documents/236781/hw3/Deep_Learning_CS/hw4/hw4/rl_pg.py:83, in PolicyAgent.reset(self)
     81 def reset(self):
     82     self.curr_state = torch.tensor(
---> 83         self.env.reset(), device=self.device, dtype=torch.float
     84     )
     85     self.curr_episode_reward = 0.0

File ~/miniconda3/envs/cs236781-hw4/lib/python3.8/site-packages/gym/wrappers/monitor.py:56, in Monitor.reset(self, **kwargs)
     54 self._before_reset()
     55 observation = self.env.reset(**kwargs)
---> 56 self._after_reset(observation)
     58 return observation

File ~/miniconda3/envs/cs236781-hw4/lib/python3.8/site-packages/gym/wrappers/monitor.py:241, in Monitor._after_reset(self, observation)
    238 # Reset the stat count
    239 self.stats_recorder.after_reset(observation)
--> 241 self.reset_video_recorder()
    243 # Bump *after* all reset activity has finished
    244 self.episode_id += 1

File ~/miniconda3/envs/cs236781-hw4/lib/python3.8/site-packages/gym/wrappers/monitor.py:267, in Monitor.reset_video_recorder(self)
    253 # Start recording the next video.
    254 #
    255 # TODO: calculate a more correct 'episode_id' upon merge
    256 self.video_recorder = video_recorder.VideoRecorder(
    257     env=self.env,
    258     base_path=os.path.join(
   (...)
    265     enabled=self._video_enabled(),
    266 )
--> 267 self.video_recorder.capture_frame()

File ~/miniconda3/envs/cs236781-hw4/lib/python3.8/site-packages/gym/wrappers/monitoring/video_recorder.py:132, in VideoRecorder.capture_frame(self)
    129 logger.debug("Capturing video frame: path=%s", self.path)
    131 render_mode = "ansi" if self.ansi_mode else "rgb_array"
--> 132 frame = self.env.render(mode=render_mode)
    134 if frame is None:
    135     if self._async:

File ~/miniconda3/envs/cs236781-hw4/lib/python3.8/site-packages/gym/core.py:295, in Wrapper.render(self, mode, **kwargs)
    294 def render(self, mode="human", **kwargs):
--> 295     return self.env.render(mode, **kwargs)

File ~/miniconda3/envs/cs236781-hw4/lib/python3.8/site-packages/gym/envs/box2d/lunar_lander.py:391, in LunarLander.render(self, mode)
    388 from gym.envs.classic_control import rendering
    390 if self.viewer is None:
--> 391     self.viewer = rendering.Viewer(VIEWPORT_W, VIEWPORT_H)
    392     self.viewer.set_bounds(0, VIEWPORT_W / SCALE, 0, VIEWPORT_H / SCALE)
    394 for obj in self.particles:

File ~/miniconda3/envs/cs236781-hw4/lib/python3.8/site-packages/gym/envs/classic_control/rendering.py:88, in Viewer.__init__(self, width, height, display)
     86 self.width = width
     87 self.height = height
---> 88 self.window = get_window(width=width, height=height, display=display)
     89 self.window.on_close = self.window_closed_by_user
     90 self.isopen = True

File ~/miniconda3/envs/cs236781-hw4/lib/python3.8/site-packages/gym/envs/classic_control/rendering.py:69, in get_window(width, height, display, **kwargs)
     65 """
     66 Will create a pyglet window from the display specification provided.
     67 """
     68 screen = display.get_screens()  # available screens
---> 69 config = screen[0].get_best_config()  # selecting the first screen
     70 context = config.create_context(None)  # create GL context
     72 return pyglet.window.Window(
     73     width=width,
     74     height=height,
   (...)
     78     **kwargs
     79 )

IndexError: list index out of range

Questions¶

TODO: Answer the following questions. Write your answers in the appropriate variables in the module hw4/answers.py.

In [ ]:
from cs236781.answers import display_answer
import hw4.answers

Question 1¶

Explain qualitatively why subtracting a baseline in the policy-gradient helps reduce it's variance. Specifically, give an example where it helps.

In [ ]:
display_answer(hw4.answers.part1_q1)

Question 2¶

In AAC, when using the estimated q-values as regression targets for our state-values, why do we get a valid approximation? Hint: how is $v_\pi(s)$ expressed in terms of $q_\pi(s,a)$?

In [ ]:
display_answer(hw4.answers.part1_q2)

Question 3¶

  1. Analyze and explain the graphs you got in first experiment run.
  2. Compare the experiment graphs you got with the AAC method to the regular PG method (cpg).
In [ ]:
display_answer(hw4.answers.part1_q3)
$$ \newcommand{\mat}[1]{\boldsymbol {#1}} \newcommand{\mattr}[1]{\boldsymbol {#1}^\top} \newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}} \newcommand{\vec}[1]{\boldsymbol {#1}} \newcommand{\vectr}[1]{\boldsymbol {#1}^\top} \newcommand{\rvar}[1]{\mathrm {#1}} \newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{\diag}{\mathop{\mathrm {diag}}} \newcommand{\set}[1]{\mathbb {#1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\bm}[1]{{\bf #1}} \newcommand{\bb}[1]{\bm{\mathrm{#1}}} $$

Part 2: Variational Autoencoder¶

In this part we will learn to generate new data using a special type of autoencoder model which allows us to sample from its latent space. We'll implement and train a VAE and use it to generate new images.

In [1]:
import unittest
import os
import sys
import pathlib
import urllib
import shutil
import re
import zipfile

import numpy as np
import torch
import matplotlib.pyplot as plt

%load_ext autoreload
%autoreload 2
In [2]:
test = unittest.TestCase()
plt.rcParams.update({'font.size': 12})
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
Using device: cuda

Obtaining the dataset¶

Let's begin by downloading a dataset of images that we want to learn to generate. We'll use the Labeled Faces in the Wild (LFW) dataset which contains many labeled faces of famous individuals.

We're going to train our generative model to generate a specific face, not just any face. Since the person with the most images in this dataset is former president George W. Bush, we'll set out to train a Bush Generator :)

However, if you feel adventurous and/or prefer to generate something else, feel free to edit the PART2_CUSTOM_DATA_URL variable in hw4/answers.py.

In [3]:
import cs236781.plot as plot
import cs236781.download
from hw4.answers import PART2_CUSTOM_DATA_URL as CUSTOM_DATA_URL

DATA_DIR = pathlib.Path.home().joinpath('.pytorch-datasets')
if CUSTOM_DATA_URL is None:
    DATA_URL = 'http://vis-www.cs.umass.edu/lfw/lfw-bush.zip'
else:
    DATA_URL = CUSTOM_DATA_URL

_, dataset_dir = cs236781.download.download_data(out_path=DATA_DIR, url=DATA_URL, extract=True, force=False)
File /home/rudman/.pytorch-datasets/lfw-bush.zip exists, skipping download.
Extracting /home/rudman/.pytorch-datasets/lfw-bush.zip...
Extracted 531 to /home/rudman/.pytorch-datasets/lfw/George_W_Bush

Create a Dataset object that will load the extraced images:

In [4]:
import torchvision.transforms as T
from torchvision.datasets import ImageFolder

im_size = 64
tf = T.Compose([
    # Resize to constant spatial dimensions
    T.Resize((im_size, im_size)),
    # PIL.Image -> torch.Tensor
    T.ToTensor(),
    # Dynamic range [0,1] -> [-1, 1]
    T.Normalize(mean=(.5,.5,.5), std=(.5,.5,.5)),
])

ds_gwb = ImageFolder(os.path.dirname(dataset_dir), tf)

OK, let's see what we got. You can run the following block multiple times to display a random subset of images from the dataset.

In [5]:
_ = plot.dataset_first_n(ds_gwb, 50, figsize=(15,10), nrows=5)
print(f'Found {len(ds_gwb)} images in dataset folder.')
Found 530 images in dataset folder.
In [6]:
x0, y0 = ds_gwb[0]
x0 = x0.unsqueeze(0).to(device)
print(x0.shape)

test.assertSequenceEqual(x0.shape, (1, 3, im_size, im_size))
torch.Size([1, 3, 64, 64])

The Variational Autoencoder¶

An autoencoder is a model which learns a representation of data in an unsupervised fashion (i.e without any labels). Recall it's general form from the lecture:

An autoencoder maps an instance $\bb{x}$ to a latent-space representation $\bb{z}$. It has an encoder part, $\Phi_{\bb{\alpha}}(\bb{x})$ (a model with parameters $\bb{\alpha}$) and a decoder part, $\Psi_{\bb{\beta}}(\bb{z})$ (a model with parameters $\bb{\beta}$).

While autoencoders can learn useful representations, generally it's hard to use them as generative models because there's no distribution we can sample from in the latent space. In other words, we have no way to choose a point $\bb{z}$ in the latent space such that $\Psi(\bb{z})$ will end up on the data manifold in the instance space.

The variational autoencoder (VAE), first proposed by Kingma and Welling, addresses this issue by taking a probabilistic perspective. Briefly, a VAE model can be described as follows.

We define, in Baysean terminology,

  • The prior distribution $p(\bb{Z})$ on points in the latent space.
  • The posterior distribution of points in the latent spaces given a specific instance: $p(\bb{Z}|\bb{X})$.
  • The likelihood distribution of a sample $\bb{X}$ given a latent-space representation: $p(\bb{X}|\bb{Z})$.
  • The evidence distribution $p(\bb{X})$ which is the distribution of the instance space due to the generative process.

To create our variational decoder we'll further specify:

  • A parametric likelihood distribution, $p _{\bb{\beta}}(\bb{X} | \bb{Z}=\bb{z}) = \mathcal{N}( \Psi _{\bb{\beta}}(\bb{z}) , \sigma^2 \bb{I} )$. The interpretation is that given a latent $\bb{z}$, we map it to a point normally distributed around the point calculated by our decoder neural network. Note that here $\sigma^2$ is a hyperparameter while $\vec{\beta}$ represents the network parameters.
  • A fixed latent-space prior distribution of $p(\bb{Z}) = \mathcal{N}(\bb{0},\bb{I})$.

This setting allows us to generate a new instance $\bb{x}$ by sampling $\bb{z}$ from the multivariate normal distribution, obtaining the instance-space mean $\Psi _{\bb{\beta}}(\bb{z})$ using our decoder network, and then sampling $\bb{x}$ from $\mathcal{N}( \Psi _{\bb{\beta}}(\bb{z}) , \sigma^2 \bb{I} )$.

Our variational encoder will approximate the posterior with a parametric distribution $q _{\bb{\alpha}}(\bb{Z} | \bb{x}) = \mathcal{N}( \bb{\mu} _{\bb{\alpha}}(\bb{x}), \mathrm{diag}\{ \bb{\sigma}^2_{\bb{\alpha}}(\bb{x}) \} )$. The interpretation is that our encoder model, $\Phi_{\vec{\alpha}}(\bb{x})$, calculates the mean and variance of the posterior distribution, and samples $\bb{z}$ based on them. An important nuance here is that our network can't contain any stochastic elements that depend on the model parameters, otherwise we won't be able to back-propagate to those parameters. So sampling $\bb{z}$ from $\mathcal{N}( \bb{\mu} _{\bb{\alpha}}(\bb{x}), \mathrm{diag}\{ \bb{\sigma}^2_{\bb{\alpha}}(\bb{x}) \} )$ is not an option. The solution is to use what's known as the reparametrization trick: sample from an isotropic Gaussian, i.e. $\bb{u}\sim\mathcal{N}(\bb{0},\bb{I})$ (which doesn't depend on trainable parameters), and calculate the latent representation as $\bb{z} = \bb{\mu} _{\bb{\alpha}}(\bb{x}) + \bb{u}\odot\bb{\sigma}_{\bb{\alpha}}(\bb{x})$.

To train a VAE model, we maximize the evidence distribution, $p(\bb{X})$ (see question below). The VAE loss can therefore be stated as minimizing $\mathcal{L} = -\mathbb{E}_{\bb{x}} \log p(\bb{X})$. Although this expectation is intractable, we can obtain a lower-bound for $p(\bb{X})$ (the evidence lower bound, "ELBO", shown in the lecture):

$$ \log p(\bb{X}) \ge \mathbb{E} _{\bb{z} \sim q _{\bb{\alpha}} }\left[ \log p _{\bb{\beta}}(\bb{X} | \bb{z}) \right] - \mathcal{D} _{\mathrm{KL}}\left(q _{\bb{\alpha}}(\bb{Z} | \bb{X})\,\left\|\, p(\bb{Z} )\right.\right) $$

where $ \mathcal{D} _{\mathrm{KL}}(q\left\|\right.p) = \mathbb{E}_{\bb{z}\sim q}\left[ \log \frac{q(\bb{Z})}{p(\bb{Z})} \right] $ is the Kullback-Liebler divergence, which can be interpreted as the information gained by using the posterior $q(\bb{Z|X})$ instead of the prior distribution $p(\bb{Z})$.

Using the ELBO, the VAE loss becomes, $$ \mathcal{L}(\vec{\alpha},\vec{\beta}) = \mathbb{E} {\bb{x}} \left[ \mathbb{E} {\bb{z} \sim q {\bb{\alpha}} }\left[ -\log p {\bb{\beta}}(\bb{x} | \bb{z}) \right]

  • \mathcal{D} {\mathrm{KL}}\left(q {\bb{\alpha}}(\bb{Z} | \bb{x})\,\left|\, p(\bb{Z} )\right.\right) \right]. $$

By remembering that the likelihood is a Gaussian distribution with a diagonal covariance and by applying the reparametrization trick, we can write the above as

$$ \mathcal{L}(\vec{\alpha},\vec{\beta}) = \mathbb{E} _{\bb{x}} \left[ \mathbb{E} _{\bb{z} \sim q _{\bb{\alpha}} } \left[ \frac{1}{2\sigma^2}\left\| \bb{x}- \Psi _{\bb{\beta}}\left( \bb{\mu} _{\bb{\alpha}}(\bb{x}) + \bb{\Sigma}^{\frac{1}{2}} _{\bb{\alpha}}(\bb{x}) \bb{u} \right) \right\| _2^2 \right] + \mathcal{D} _{\mathrm{KL}}\left(q _{\bb{\alpha}}(\bb{Z} | \bb{x})\,\left\|\, p(\bb{Z} )\right.\right) \right]. $$

Model Implementation¶

Obviously our model will have two parts, an encoder and a decoder. Since we're working with images, we'll implement both as deep convolutional networks, where the decoder is a "mirror image" of the encoder implemented with adjoint (AKA transposed) convolutions. Between the encoder CNN and the decoder CNN we'll implement the sampling from the parametric posterior approximator $q_{\bb{\alpha}}(\bb{Z}|\bb{x})$ to make it a VAE model and not just a regular autoencoder (of course, this is not yet enough to create a VAE, since we also need a special loss function which we'll get to later).

First let's implement just the CNN part of the Encoder network (this is not the full $\Phi_{\vec{\alpha}}(\bb{x})$ yet). As usual, it should take an input image and map to a activation volume of a specified depth. We'll consider this volume as the features we extract from the input image. Later we'll use these to create the latent space representation of the input.

TODO: Implement the EncoderCNN class in the hw4/autoencoder.py module. Implement any CNN architecture you like. If you need "architecture inspiration" you can see e.g. this or this paper.

In [7]:
import hw4.autoencoder as autoencoder

in_channels = 3
out_channels = 1024
encoder_cnn = autoencoder.EncoderCNN(in_channels, out_channels).to(device)
print(encoder_cnn)

h = encoder_cnn(x0)
print(h.shape)

test.assertEqual(h.dim(), 4)
test.assertSequenceEqual(h.shape[0:2], (1, out_channels))
EncoderCNN(
  (cnn): Sequential(
    (0): Conv2d(3, 32, kernel_size=(4, 4), stride=(2, 2))
    (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): LeakyReLU(negative_slope=0.2)
    (3): Conv2d(32, 64, kernel_size=(4, 4), stride=(2, 2))
    (4): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (5): LeakyReLU(negative_slope=0.2)
    (6): Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2))
    (7): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (8): LeakyReLU(negative_slope=0.2)
    (9): Conv2d(128, 1024, kernel_size=(4, 4), stride=(2, 2))
  )
)
torch.Size([1, 1024, 2, 2])

Now let's implement the CNN part of the Decoder. Again this is not yet the full $\Psi _{\bb{\beta}}(\bb{z})$. It should take an activation volume produced by your EncoderCNN and output an image of the same dimensions as the Encoder's input was. This can be a CNN which is like a "mirror image" of the the Encoder. For example, replace convolutions with transposed convolutions, downsampling with up-sampling etc. Consult the documentation of ConvTranspose2D to figure out how to reverse your convolutional layers in terms of input and output dimensions. Note that the decoder doesn't have to be exactly the opposite of the encoder and you can experiment with using a different architecture.

TODO: Implement the DecoderCNN class in the hw4/autoencoder.py module.

In [8]:
decoder_cnn = autoencoder.DecoderCNN(in_channels=out_channels, out_channels=in_channels).to(device)
print(decoder_cnn)
x0r = decoder_cnn(h)
print(x0r.shape)

test.assertEqual(x0.shape, x0r.shape)

# Should look like colored noise
T.functional.to_pil_image(x0r[0].cpu().detach())
DecoderCNN(
  (cnn): Sequential(
    (0): ConvTranspose2d(1024, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
    (3): ConvTranspose2d(256, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (5): ReLU()
    (6): ConvTranspose2d(128, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (7): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (8): ReLU()
    (9): ConvTranspose2d(64, 32, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (10): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (11): ReLU()
    (12): ConvTranspose2d(32, 3, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
  )
)
torch.Size([1, 3, 64, 64])
Out[8]:

Let's now implement the full VAE Encoder, $\Phi_{\vec{\alpha}}(\vec{x})$. It will work as follows:

  1. Produce a feature vector $\vec{h}$ from the input image $\vec{x}$.
  2. Use two affine transforms to convert the features into the mean and log-variance of the posterior, i.e. $$ \begin{align}
     \bb{\mu} _{\bb{\alpha}}(\bb{x}) &= \vec{h}\mattr{W}_{\mathrm{h\mu}} + \vec{b}_{\mathrm{h\mu}} \\
     \log\left(\bb{\sigma}^2_{\bb{\alpha}}(\bb{x})\right) &= \vec{h}\mattr{W}_{\mathrm{h\sigma^2}} + \vec{b}_{\mathrm{h\sigma^2}}
    
    \end{align} $$
  3. Use the reparametrization trick to create the latent representation $\vec{z}$.

Notice that we model the log of the variance, not the actual variance. The above formulation is proposed in appendix C of the VAE paper.

TODO: Implement the encode() method in the VAE class within the hw4/autoencoder.py module. You'll also need to define your parameters in __init__().

In [9]:
z_dim = 2
vae = autoencoder.VAE(encoder_cnn, decoder_cnn, x0[0].size(), z_dim).to(device)
print(vae)

z, mu, log_sigma2 = vae.encode(x0)

test.assertSequenceEqual(z.shape, (1, z_dim))
test.assertTrue(z.shape == mu.shape == log_sigma2.shape)

print(f'mu(x0)={list(*mu.detach().cpu().numpy())}, sigma2(x0)={list(*torch.exp(log_sigma2).detach().cpu().numpy())}')
VAE(
  (features_encoder): EncoderCNN(
    (cnn): Sequential(
      (0): Conv2d(3, 32, kernel_size=(4, 4), stride=(2, 2))
      (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): LeakyReLU(negative_slope=0.2)
      (3): Conv2d(32, 64, kernel_size=(4, 4), stride=(2, 2))
      (4): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (5): LeakyReLU(negative_slope=0.2)
      (6): Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2))
      (7): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (8): LeakyReLU(negative_slope=0.2)
      (9): Conv2d(128, 1024, kernel_size=(4, 4), stride=(2, 2))
    )
  )
  (features_decoder): DecoderCNN(
    (cnn): Sequential(
      (0): ConvTranspose2d(1024, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU()
      (3): ConvTranspose2d(256, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (5): ReLU()
      (6): ConvTranspose2d(128, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (7): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (8): ReLU()
      (9): ConvTranspose2d(64, 32, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (10): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (11): ReLU()
      (12): ConvTranspose2d(32, 3, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    )
  )
  (mu_layer): Sequential(
    (0): Linear(in_features=4096, out_features=2, bias=True)
  )
  (log_sigma2_layer): Sequential(
    (0): Linear(in_features=4096, out_features=2, bias=True)
  )
  (decoder_in): Sequential(
    (0): Linear(in_features=2, out_features=4096, bias=True)
  )
)
mu(x0)=[-0.4129818, 0.2801996], sigma2(x0)=[1.0662273, 0.9347149]

Let's sample some 2d latent representations for an input image x0 and visualize them.

In [10]:
# Sample from q(Z|x)
N = 500
Z = torch.zeros(N, z_dim)
_, ax = plt.subplots()
with torch.no_grad():
    for i in range(N):
        Z[i], _, _ = vae.encode(x0)
        ax.scatter(*Z[i].cpu().numpy())

# Should be close to the mu/sigma in the previous block above
print('sampled mu', torch.mean(Z, dim=0))
print('sampled sigma2', torch.var(Z, dim=0))
sampled mu tensor([-0.4850,  0.3184])
sampled sigma2 tensor([0.8914, 0.8104])

Let's now implement the full VAE Decoder, $\Psi _{\bb{\beta}}(\bb{z})$. It will work as follows:

  1. Produce a feature vector $\tilde{\vec{h}}$ from the latent vector $\vec{z}$ using an affine transform.
  2. Reconstruct an image $\tilde{\vec{x}}$ from $\tilde{\vec{h}}$ using the decoder CNN.

TODO: Implement the decode() method in the VAE class within the hw4/autoencoder.py module. You'll also need to define your parameters in __init__(). You may need to also re-run the block above after you implement this.

In [11]:
x0r = vae.decode(z)

test.assertSequenceEqual(x0r.shape, x0.shape)

Our model's forward() function will simply return decode(encode(x)) as well as the calculated mean and log-variance of the posterior.

In [12]:
x0r, mu, log_sigma2 = vae(x0)

test.assertSequenceEqual(x0r.shape, x0.shape)
test.assertSequenceEqual(mu.shape, (1, z_dim))
test.assertSequenceEqual(log_sigma2.shape, (1, z_dim))
T.functional.to_pil_image(x0r[0].detach().cpu())
Out[12]:

Loss Implementation¶

In practice, since we're using SGD, we'll drop the expectation over $\bb{X}$ and instead sample an instance from the training set and compute a point-wise loss. Similarly, we'll drop the expectation over $\bb{Z}$ by sampling from $q_{\vec{\alpha}}(\bb{Z}|\bb{x})$. Additionally, because the KL divergence is between two Gaussian distributions, there is a closed-form expression for it. These points bring us to the following point-wise loss:

$$ \ell(\vec{\alpha},\vec{\beta};\bb{x}) = \frac{1}{\sigma^2 d_x} \left\| \bb{x}- \Psi _{\bb{\beta}}\left( \bb{\mu} _{\bb{\alpha}}(\bb{x}) + \bb{\Sigma}^{\frac{1}{2}} _{\bb{\alpha}}(\bb{x}) \bb{u} \right) \right\| _2^2 + \mathrm{tr}\,\bb{\Sigma} _{\bb{\alpha}}(\bb{x}) + \|\bb{\mu} _{\bb{\alpha}}(\bb{x})\|^2 _2 - d_z - \log\det \bb{\Sigma} _{\bb{\alpha}}(\bb{x}), $$

where $d_z$ is the dimension of the latent space, $d_x$ is the dimension of the input and $\bb{u}\sim\mathcal{N}(\bb{0},\bb{I})$. This pointwise loss is the quantity that we'll compute and minimize with gradient descent. The first term corresponds to the data-reconstruction loss, while the second term corresponds to the KL-divergence loss. Note that the scaling by $d_x$ is not derived from the original loss formula and was added directly to the pointwise loss just to normalize the data term.

TODO: Implement the vae_loss() function in the hw4/autoencoder.py module.

In [13]:
from hw4.autoencoder import vae_loss
torch.manual_seed(42)

def test_vae_loss():
    # Test data
    N, C, H, W = 10, 3, 64, 64 
    z_dim = 32
    x  = torch.randn(N, C, H, W)*2 - 1
    xr = torch.randn(N, C, H, W)*2 - 1
    z_mu = torch.randn(N, z_dim)
    z_log_sigma2 = torch.randn(N, z_dim)
    x_sigma2 = 0.9
    
    loss, _, _ = vae_loss(x, xr, z_mu, z_log_sigma2, x_sigma2)
    
    test.assertAlmostEqual(loss.item(), 58.3234367, delta=1e-3)
    return loss

test_vae_loss()
Out[13]:
tensor(58.3234)

Sampling¶

The main advantage of a VAE is that it can by used as a generative model by sampling the latent space, since we optimize for a isotropic Gaussian prior $p(\bb{Z})$ in the loss function. Let's now implement this so that we can visualize how our model is doing when we train.

TODO: Implement the sample() method in the VAE class within the hw4/autoencoder.py module.

In [14]:
samples = vae.sample(5)
_ = plot.tensors_as_images(samples)

Training¶

Time to train!

TODO:

  1. Implement the VAETrainer class in the hw4/training.py module. Make sure to implement the checkpoints feature of the Trainer class if you haven't done so already in Part 1.
  2. Tweak the hyperparameters in the part2_vae_hyperparams() function within the hw4/answers.py module.
In [15]:
import torch.optim as optim
from torch.utils.data import random_split
from torch.utils.data import DataLoader
from torch.nn import DataParallel
from hw4.training import VAETrainer
from hw4.answers import part2_vae_hyperparams

torch.manual_seed(42)

# Hyperparams
hp = part2_vae_hyperparams()
batch_size = hp['batch_size']
h_dim = hp['h_dim']
z_dim = hp['z_dim']
x_sigma2 = hp['x_sigma2']
learn_rate = hp['learn_rate']
betas = hp['betas']

# Data
split_lengths = [int(len(ds_gwb)*0.9), int(len(ds_gwb)*0.1)]
ds_train, ds_test = random_split(ds_gwb, split_lengths)
dl_train = DataLoader(ds_train, batch_size, shuffle=True)
dl_test  = DataLoader(ds_test,  batch_size, shuffle=True)
im_size = ds_train[0][0].shape

# Model
encoder = autoencoder.EncoderCNN(in_channels=im_size[0], out_channels=h_dim)
decoder = autoencoder.DecoderCNN(in_channels=h_dim, out_channels=im_size[0])
vae = autoencoder.VAE(encoder, decoder, im_size, z_dim)
vae_dp = DataParallel(vae).to(device)

# Optimizer
optimizer = optim.Adam(vae.parameters(), lr=learn_rate, betas=betas)

# Loss
def loss_fn(x, xr, z_mu, z_log_sigma2):
    return autoencoder.vae_loss(x, xr, z_mu, z_log_sigma2, x_sigma2)

# Trainer
trainer = VAETrainer(vae_dp, loss_fn, optimizer, device)
checkpoint_file = 'checkpoints/vae'
checkpoint_file_final = f'{checkpoint_file}_final'
if os.path.isfile(f'{checkpoint_file}.pt'):
    os.remove(f'{checkpoint_file}.pt')

# Show model and hypers
print(vae)
print(hp)
VAE(
  (features_encoder): EncoderCNN(
    (cnn): Sequential(
      (0): Conv2d(3, 32, kernel_size=(4, 4), stride=(2, 2))
      (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): LeakyReLU(negative_slope=0.2)
      (3): Conv2d(32, 64, kernel_size=(4, 4), stride=(2, 2))
      (4): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (5): LeakyReLU(negative_slope=0.2)
      (6): Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2))
      (7): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (8): LeakyReLU(negative_slope=0.2)
      (9): Conv2d(128, 512, kernel_size=(4, 4), stride=(2, 2))
    )
  )
  (features_decoder): DecoderCNN(
    (cnn): Sequential(
      (0): ConvTranspose2d(512, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (2): ReLU()
      (3): ConvTranspose2d(256, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (5): ReLU()
      (6): ConvTranspose2d(128, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (7): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (8): ReLU()
      (9): ConvTranspose2d(64, 32, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (10): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (11): ReLU()
      (12): ConvTranspose2d(32, 3, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    )
  )
  (mu_layer): Sequential(
    (0): Linear(in_features=2048, out_features=128, bias=True)
  )
  (log_sigma2_layer): Sequential(
    (0): Linear(in_features=2048, out_features=128, bias=True)
  )
  (decoder_in): Sequential(
    (0): Linear(in_features=128, out_features=2048, bias=True)
  )
)
{'batch_size': 32, 'h_dim': 512, 'z_dim': 128, 'x_sigma2': 0.0001, 'learn_rate': 0.001, 'betas': (0.5, 0.55)}

TODO:

  1. Run the following block to train. It will sample some images from your model every few epochs so you can see the progress.
  2. When you're satisfied with your results, rename the checkpoints file by adding _final. When you run the main.py script to generate your submission, the final checkpoints file will be loaded instead of running training. Note that your final submission zip will not include the checkpoints/ folder. This is OK.

The images you get should be colorful, with different backgrounds and poses.

In [16]:
import IPython.display

def post_epoch_fn(epoch, train_result, test_result, verbose):
    # Plot some samples if this is a verbose epoch
    if verbose:
        samples = vae.sample(n=5)
        fig, _ = plot.tensors_as_images(samples, figsize=(6,2))
        IPython.display.display(fig)
        plt.close(fig)

if os.path.isfile(f'{checkpoint_file_final}.pt'):
    print(f'*** Loading final checkpoint file {checkpoint_file_final} instead of training')
    checkpoint_file = checkpoint_file_final
else:
    res = trainer.fit(dl_train, dl_test,
                      num_epochs=200, early_stopping=20, print_every=10,
                      checkpoints=checkpoint_file,
                      post_epoch_fn=post_epoch_fn)
    
# Plot images from best model
saved_state = torch.load(f'{checkpoint_file}.pt', map_location=device)
vae_dp.load_state_dict(saved_state['model_state'])
print('*** Images Generated from best model:')
fig, _ = plot.tensors_as_images(vae_dp.module.sample(n=15), nrows=3, figsize=(6,6))
--- EPOCH 1/200 ---
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint
--- EPOCH 11/200 ---
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
--- EPOCH 21/200 ---
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
--- EPOCH 31/200 ---
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint
--- EPOCH 41/200 ---
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
--- EPOCH 51/200 ---
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
*** Saved checkpoint checkpoint
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
--- EPOCH 61/200 ---
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
--- EPOCH 71/200 ---
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
--- EPOCH 81/200 ---
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
--- EPOCH 91/200 ---
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
--- EPOCH 101/200 ---
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
--- EPOCH 111/200 ---
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
--- EPOCH 121/200 ---
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
--- EPOCH 131/200 ---
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
--- EPOCH 141/200 ---
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
--- EPOCH 151/200 ---
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
--- EPOCH 161/200 ---
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
--- EPOCH 171/200 ---
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
--- EPOCH 181/200 ---
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
--- EPOCH 191/200 ---
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
--- EPOCH 200/200 ---
train_batch:   0%|          | 0/15 [00:00<?, ?it/s]
test_batch:   0%|          | 0/2 [00:00<?, ?it/s]
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Input In [16], in <module>
     15     res = trainer.fit(dl_train, dl_test,
     16                       num_epochs=200, early_stopping=20, print_every=10,
     17                       checkpoints=checkpoint_file,
     18                       post_epoch_fn=post_epoch_fn)
     20 # Plot images from best model
---> 21 saved_state = torch.load(f'{checkpoint_file}.pt', map_location=device)
     22 vae_dp.load_state_dict(saved_state['model_state'])
     23 print('*** Images Generated from best model:')

File ~/miniconda3/envs/cs236781-hw4/lib/python3.8/site-packages/torch/serialization.py:594, in load(f, map_location, pickle_module, **pickle_load_args)
    591 if 'encoding' not in pickle_load_args.keys():
    592     pickle_load_args['encoding'] = 'utf-8'
--> 594 with _open_file_like(f, 'rb') as opened_file:
    595     if _is_zipfile(opened_file):
    596         # The zipfile reader is going to advance the current file position.
    597         # If we want to actually tail call to torch.jit.load, we need to
    598         # reset back to the original position.
    599         orig_position = opened_file.tell()

File ~/miniconda3/envs/cs236781-hw4/lib/python3.8/site-packages/torch/serialization.py:230, in _open_file_like(name_or_buffer, mode)
    228 def _open_file_like(name_or_buffer, mode):
    229     if _is_path(name_or_buffer):
--> 230         return _open_file(name_or_buffer, mode)
    231     else:
    232         if 'w' in mode:

File ~/miniconda3/envs/cs236781-hw4/lib/python3.8/site-packages/torch/serialization.py:211, in _open_file.__init__(self, name, mode)
    210 def __init__(self, name, mode):
--> 211     super(_open_file, self).__init__(open(name, mode))

FileNotFoundError: [Errno 2] No such file or directory: 'checkpoints/vae.pt'

Questions¶

TODO Answer the following questions. Write your answers in the appropriate variables in the module hw4/answers.py.

In [17]:
from cs236781.answers import display_answer
import hw4.answers as answers

Question 1¶

What does the $\sigma^2$ hyperparameter (x_sigma2 in the code) do? Explain the effect of low and high values.

In [18]:
display_answer(answers.part2_q1)

Your answer:

$\sigma^2$ - likelihood variance, it does the regularization of the data-reconstruction loss, which is the term in the total loss equation. High values will reduce the influence of this term on the total loss, this will favour the regularisation term over the reconstruction term, it will cause the images to be closer to the input. The opposite stands if $\sigma^2$ is low.

Question 2¶

  1. Explain the purpose of both parts of the VAE loss term - reconstruction loss and KL divergence loss.
  2. How is the latent-space distribution affected by the KL loss term?
  3. What's the benefit of this effect?
In [19]:
display_answer(answers.part2_q2)

Your answer:

  1. The KL divergence between two probability distributions simply measures how much they diverge from each other. Minimizing the KL divergence here means optimizing the probability distribution parameters to closely resemble that of the target distribution. The reconstruction term will try to improve the quality of the reconstruction, neglecting the shape of the latent space.
  1. KL divergence normalizes and makes the latent space smoother, reduces the overfitting to the training data.
  1. If the parametrs of the loss are properly tuned - decoder will not just decode single, specific encodings in the latent space, but ones that slightly vary too, as the decoder is exposed to a range of variations of the encoding of the same input during training.

Question 3¶

In the formulation of the VAE loss, why do we start by maximizing the evidence distribution, $p(\bb{X})$?

In [20]:
display_answer(answers.part2_q3)

Your answer:

By maximizing the evidence distribution we provide our model an ability to generate data from the latent space with the same distribution as the data in an instance space.

Question 4¶

In the VAE encoder, why do we model the log of the latent-space variance corresponding to an input, $\bb{\sigma}^2_{\bb{\alpha}}$, instead of directly modelling this variance?

In [21]:
display_answer(answers.part2_q4)

Your answer:

We model the log to increase the range of the latent space distribution, since the values of the variance are always positive.

$$ \newcommand{\mat}[1]{\boldsymbol {#1}} \newcommand{\mattr}[1]{\boldsymbol {#1}^\top} \newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}} \newcommand{\vec}[1]{\boldsymbol {#1}} \newcommand{\vectr}[1]{\boldsymbol {#1}^\top} \newcommand{\rvar}[1]{\mathrm {#1}} \newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{\diag}{\mathop{\mathrm {diag}}} \newcommand{\set}[1]{\mathbb {#1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\bm}[1]{{\bf #1}} \newcommand{\bb}[1]{\bm{\mathrm{#1}}} $$

Part 3: Generative Adversarial Networks¶

In this part we will implement and train a generative adversarial network and apply it to the task of image generation.

In [2]:
import unittest
import os
import sys
import pathlib
import urllib
import shutil
import re
import zipfile

import numpy as np
import torch
import matplotlib.pyplot as plt

%load_ext autoreload
%autoreload 2

test = unittest.TestCase()
plt.rcParams.update({'font.size': 12})
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
Using device: cpu

Obtaining the dataset¶

We'll use the same data as in Part 2.

But again, you can use a custom dataset, by editing the PART3_CUSTOM_DATA_URL variable in hw4/answers.py.

In [3]:
import cs236781.plot as plot
import cs236781.download
from hw4.answers import PART3_CUSTOM_DATA_URL as CUSTOM_DATA_URL

DATA_DIR = pathlib.Path.home().joinpath('.pytorch-datasets')
if CUSTOM_DATA_URL is None:
    DATA_URL = 'http://vis-www.cs.umass.edu/lfw/lfw-bush.zip'
else:
    DATA_URL = CUSTOM_DATA_URL

_, dataset_dir = cs236781.download.download_data(out_path=DATA_DIR, url=DATA_URL, extract=True, force=False)
File /Users/romy/.pytorch-datasets/lfw-bush.zip exists, skipping download.
Extracting /Users/romy/.pytorch-datasets/lfw-bush.zip...
Extracted 531 to /Users/romy/.pytorch-datasets/lfw/George_W_Bush

Create a Dataset object that will load the extraced images:

In [4]:
import torchvision.transforms as T
from torchvision.datasets import ImageFolder

im_size = 64
tf = T.Compose([
    # Resize to constant spatial dimensions
    T.Resize((im_size, im_size)),
    # PIL.Image -> torch.Tensor
    T.ToTensor(),
    # Dynamic range [0,1] -> [-1, 1]
    T.Normalize(mean=(.5,.5,.5), std=(.5,.5,.5)),
])

ds_gwb = ImageFolder(os.path.dirname(dataset_dir), tf)

OK, let's see what we got. You can run the following block multiple times to display a random subset of images from the dataset.

In [5]:
_ = plot.dataset_first_n(ds_gwb, 50, figsize=(15,10), nrows=5)
print(f'Found {len(ds_gwb)} images in dataset folder.')
Found 530 images in dataset folder.
In [6]:
x0, y0 = ds_gwb[0]
x0 = x0.unsqueeze(0).to(device)
print(x0.shape)

test.assertSequenceEqual(x0.shape, (1, 3, im_size, im_size))
torch.Size([1, 3, 64, 64])

Generative Adversarial Nets (GANs)¶

GANs, first proposed in a paper by Ian Goodfellow in 2014 are today arguably the most popular type of generative model. GANs are currently producing state of the art results in generative tasks over many different domains.

In a GAN model, two different neural networks compete against each other: A generator and a discriminator.

  • The Generator, which we'll denote as $\Psi _{\bb{\gamma}} : \mathcal{U} \rightarrow \mathcal{X}$, maps a latent-space variable $\bb{u}\sim\mathcal{N}(\bb{0},\bb{I})$ to an instance-space variable $\bb{x}$ (e.g. an image). Thus a parametric evidence distribution $p_{\bb{\gamma}}(\bb{X})$ is generated, which we typically would like to be as close as possible to the real evidence distribution, $p(\bb{X})$.

  • The Discriminator, $\Delta _{\bb{\delta}} : \mathcal{X} \rightarrow [0,1]$, is a network which, given an instance-space variable $\bb{x}$, returns the probability that $\bb{x}$ is real, i.e. that $\bb{x}$ was sampled from $p(\bb{X})$ and not $p_{\bb{\gamma}}(\bb{X})$.

Training GANs¶

The generator is trained to generate "fake" instances which will maximally fool the discriminator into returning that they're real. Mathematically, the generator's parameters $\bb{\gamma}$ should be chosen such as to maximize the expression $$ \mathbb{E} _{\bb{z} \sim p(\bb{Z}) } \log (\Delta _{\bb{\delta}}(\Psi _{\bb{\gamma}} (\bb{z}) )). $$

The discriminator is trained to classify between real images, coming from the training set, and fake images generated by the generator. Mathematically, the discriminator's parameters $\bb{\delta}$ should be chosen such as to maximize the expression $$ \mathbb{E} _{\bb{x} \sim p(\bb{X}) } \log \Delta _{\bb{\delta}}(\bb{x}) \, + \, \mathbb{E} _{\bb{z} \sim p(\bb{Z}) } \log (1-\Delta _{\bb{\delta}}(\Psi _{\bb{\gamma}} (\bb{z}) )). $$

These two competing objectives can thus be expressed as the following min-max optimization: $$ \min _{\bb{\gamma}} \max _{\bb{\delta}} \, \mathbb{E} _{\bb{x} \sim p(\bb{X}) } \log \Delta _{\bb{\delta}}(\bb{x}) \, + \, \mathbb{E} _{\bb{z} \sim p(\bb{Z}) } \log (1-\Delta _{\bb{\delta}}(\Psi _{\bb{\gamma}} (\bb{z}) )). $$

A key insight into GANs is that we can interpret the above maximum as the loss with respect to $\bb{\gamma}$:

$$ L({\bb{\gamma}}) = \max _{\bb{\delta}} \, \mathbb{E} _{\bb{x} \sim p(\bb{X}) } \log \Delta _{\bb{\delta}}(\bb{x}) \, + \, \mathbb{E} _{\bb{z} \sim p(\bb{Z}) } \log (1-\Delta _{\bb{\delta}}(\Psi _{\bb{\gamma}} (\bb{z}) )). $$

This means that the generator's loss function trains together with the generator itself in an adversarial manner. In contrast, when training our VAE we used a fixed L2 norm as a data loss term.

Model Implementation¶

We'll now implement a Deep Convolutional GAN (DCGAN) model. See the DCGAN paper for architecture ideas and tips for training.

TODO: Implement the Discriminator class in the hw4/gan.py module. If you wish you can reuse the EncoderCNN class from the VAE model as the first part of the Discriminator.

In [7]:
import hw4.gan as gan

dsc = gan.Discriminator(in_size=x0[0].shape).to(device)
print(dsc)

d0 = dsc(x0)
print(d0.shape)

test.assertSequenceEqual(d0.shape, (1,1))
Discriminator(
  (cnn): Sequential(
    (0): Conv2d(3, 4, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
    (1): LeakyReLU(negative_slope=0.2, inplace=True)
    (2): Conv2d(4, 8, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
    (3): BatchNorm2d(8, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (4): LeakyReLU(negative_slope=0.2, inplace=True)
    (5): Conv2d(8, 16, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
    (6): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (7): LeakyReLU(negative_slope=0.2, inplace=True)
    (8): Conv2d(16, 1, kernel_size=(4, 4), stride=(1, 1), bias=False)
  )
  (fc): Sequential(
    (0): Linear(in_features=25, out_features=1, bias=True)
  )
)
torch.Size([1, 1])

TODO: Implement the Generator class in the hw4/gan.py module. If you wish you can reuse the DecoderCNN class from the VAE model as the last part of the Generator.

In [8]:
z_dim = 128
gen = gan.Generator(z_dim, 4).to(device)
print(gen)

z = torch.randn(1, z_dim).to(device)
xr = gen(z)
print(xr.shape)

test.assertSequenceEqual(x0.shape, xr.shape)
Generator(
  (net): Sequential(
    (0): ConvTranspose2d(1024, 512, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
    (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU(inplace=True)
    (3): ConvTranspose2d(512, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
    (4): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (5): ReLU(inplace=True)
    (6): ConvTranspose2d(256, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
    (7): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (8): ReLU(inplace=True)
    (9): ConvTranspose2d(128, 3, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)
    (10): Tanh()
  )
  (projection): Sequential(
    (0): Linear(in_features=128, out_features=16384, bias=True)
  )
)
torch.Size([1, 3, 64, 64])

Loss Implementation¶

Let's begin with the discriminator's loss function. Based on the above we can flip the sign and say we want to update the Discriminator's parameters $\bb{\delta}$ so that they minimize the expression $$

  • \mathbb{E} {\bb{x} \sim p(\bb{X}) } \log \Delta {\bb{\delta}}(\bb{x}) \, - \, \mathbb{E} {\bb{z} \sim p(\bb{Z}) } \log (1-\Delta {\bb{\delta}}(\Psi _{\bb{\gamma}} (\bb{z}) )). $$

We're using the Discriminator twice in this expression; once to classify data from the real data distribution and once again to classify generated data. Therefore our loss should be computed based on these two terms. Notice that since the discriminator returns a probability, we can formulate the above as two cross-entropy losses.

GANs are notoriously diffucult to train. One common trick for improving GAN stability during training is to make the classification labels noisy for the discriminator. This can be seen as a form of regularization, to help prevent the discriminator from overfitting.

We'll incorporate this idea into our loss function. Instead of labels being equal to 0 or 1, we'll make them "fuzzy", i.e. random numbers in the ranges $[0\pm\epsilon]$ and $[1\pm\epsilon]$.

TODO: Implement the discriminator_loss_fn() function in the hw4/gan.py module.

In [9]:
from hw4.gan import discriminator_loss_fn
torch.manual_seed(42)

y_data = torch.rand(10) * 10
y_generated = torch.rand(10) * 10

loss = discriminator_loss_fn(y_data, y_generated, data_label=1, label_noise=0.3)
print(loss)

test.assertAlmostEqual(loss.item(), 6.4808731, delta=1e-5)
tensor(6.4809)

Similarly, the generator's parameters $\bb{\gamma}$ should minimize the expression $$ -\mathbb{E} _{\bb{z} \sim p(\bb{Z}) } \log (\Delta _{\bb{\delta}}(\Psi _{\bb{\gamma}} (\bb{z}) )) $$

which can also be seen as a cross-entropy term. This corresponds to "fooling" the discriminator; Notice that the gradient of the loss w.r.t $\bb{\gamma}$ using this expression also depends on $\bb{\delta}$.

TODO: Implement the generator_loss_fn() function in the hw4/gan.py module.

In [10]:
from hw4.gan import generator_loss_fn
torch.manual_seed(42)

y_generated = torch.rand(20) * 10

loss = generator_loss_fn(y_generated, data_label=1)
print(loss)

test.assertAlmostEqual(loss.item(), 0.0222969, delta=1e-3)
tensor(0.0223)

Sampling¶

Sampling from a GAN is straightforward, since it learns to generate data from an isotropic Gaussian latent space distribution.

There is an important nuance however. Sampling is required during the process of training the GAN, since we generate fake images to show the discriminator. As you'll seen in the next section, in some cases we'll need our samples to have gradients (i.e., to be part of the Generator's computation graph).

TODO: Implement the sample() method in the Generator class within the hw4/gan.py module.

In [11]:
samples = gen.sample(5, with_grad=False)
test.assertSequenceEqual(samples.shape, (5, *x0.shape[1:]))
test.assertIsNone(samples.grad_fn)
_ = plot.tensors_as_images(samples.cpu())

samples = gen.sample(5, with_grad=True)
test.assertSequenceEqual(samples.shape, (5, *x0.shape[1:]))
test.assertIsNotNone(samples.grad_fn)

Training¶

Training GANs is a bit different since we need to train two models simultaneously, each with it's own separate loss function and optimizer. We'll implement the training logic as a function that handles one batch of data and updates both the discriminator and the generator based on it.

As mentioned above, GANs are considered hard to train. To get some ideas and tips you can see this paper, this list of "GAN hacks" or just do it the hard way :)

TODO:

  1. Implement the train_batch function in the hw4/gan.py module.
  2. Tweak the hyperparameters in the part3_gan_hyperparams() function within the hw4/answers.py module.
In [13]:
import torch.optim as optim
from torch.utils.data import DataLoader
from hw4.answers import part3_gan_hyperparams

torch.manual_seed(42)

# Hyperparams
hp = part3_gan_hyperparams()
batch_size = hp['batch_size']
z_dim = hp['z_dim']

# Data
dl_train = DataLoader(ds_gwb, batch_size, shuffle=True)
im_size = ds_gwb[0][0].shape

# Model
dsc = gan.Discriminator(im_size).to(device)
gen = gan.Generator(z_dim, featuremap_size=4).to(device)

# Optimizer
def create_optimizer(model_params, opt_params):
    opt_params = opt_params.copy()
    optimizer_type = opt_params['type']
    opt_params.pop('type')
    return optim.__dict__[optimizer_type](model_params, **opt_params)
dsc_optimizer = create_optimizer(dsc.parameters(), hp['discriminator_optimizer'])
gen_optimizer = create_optimizer(gen.parameters(), hp['generator_optimizer'])

# Loss
def dsc_loss_fn(y_data, y_generated):
    return gan.discriminator_loss_fn(y_data, y_generated, hp['data_label'], hp['label_noise'])

def gen_loss_fn(y_generated):
    return gan.generator_loss_fn(y_generated, hp['data_label'])

# Training
checkpoint_file = 'checkpoints/gan'
checkpoint_file_final = f'{checkpoint_file}_final'
if os.path.isfile(f'{checkpoint_file}.pt'):
    os.remove(f'{checkpoint_file}.pt')

# Show hypers
print(hp)
{'batch_size': 4, 'z_dim': 512, 'data_label': 0, 'label_noise': 0.4, 'discriminator_optimizer': {'type': 'SGD', 'lr': 0.01}, 'generator_optimizer': {'type': 'SGD', 'lr': 0.01}}

TODO:

  1. Implement the save_checkpoint function in the hw4.gan module. You can decide on your own criterion regarding whether to save a checkpoint at the end of each epoch.
  2. Run the following block to train. It will sample some images from your model every few epochs so you can see the progress.
  3. When you're satisfied with your results, rename the checkpoints file by adding _final. When you run the main.py script to generate your submission, the final checkpoints file will be loaded instead of running training. Note that your final submission zip will not include the checkpoints/ folder. This is OK.
In [17]:
import IPython.display
import tqdm
from hw4.gan import train_batch, save_checkpoint

num_epochs = 100

if os.path.isfile(f'{checkpoint_file_final}.pt'):
    print(f'*** Loading final checkpoint file {checkpoint_file_final} instead of training')
    num_epochs = 0
    gen = torch.load(f'{checkpoint_file_final}.pt', map_location=device,)
    checkpoint_file = checkpoint_file_final

try:
    dsc_avg_losses, gen_avg_losses = [], []
    for epoch_idx in range(num_epochs):
        # We'll accumulate batch losses and show an average once per epoch.
        dsc_losses, gen_losses = [], []
        print(f'--- EPOCH {epoch_idx+1}/{num_epochs} ---')

        with tqdm.tqdm(total=len(dl_train.batch_sampler), file=sys.stdout) as pbar:
            for batch_idx, (x_data, _) in enumerate(dl_train):
                x_data = x_data.to(device)
                dsc_loss, gen_loss = train_batch(
                    dsc, gen,
                    dsc_loss_fn, gen_loss_fn,
                    dsc_optimizer, gen_optimizer,
                    x_data)
                dsc_losses.append(dsc_loss)
                gen_losses.append(gen_loss)
                pbar.update()

        dsc_avg_losses.append(np.mean(dsc_losses))
        gen_avg_losses.append(np.mean(gen_losses))
        print(f'Discriminator loss: {dsc_avg_losses[-1]}')
        print(f'Generator loss:     {gen_avg_losses[-1]}')
        
        if save_checkpoint(gen, dsc_avg_losses, gen_avg_losses, checkpoint_file):
            print(f'Saved checkpoint.')
            

        samples = gen.sample(5, with_grad=False)
        fig, _ = plot.tensors_as_images(samples.cpu(), figsize=(6,2))
        IPython.display.display(fig)
        plt.close(fig)
except KeyboardInterrupt as e:
    print('\n *** Training interrupted by user')
--- EPOCH 1/100 ---
100%|█████████████████████████████████████████| 133/133 [00:28<00:00,  4.67it/s]
Discriminator loss: 1.3612773373610991
Generator loss:     0.7768391115324838
--- EPOCH 2/100 ---
100%|█████████████████████████████████████████| 133/133 [00:29<00:00,  4.57it/s]
Discriminator loss: 1.353277926158188
Generator loss:     0.7538853556589973
--- EPOCH 3/100 ---
100%|█████████████████████████████████████████| 133/133 [00:29<00:00,  4.53it/s]
Discriminator loss: 1.3340200499484414
Generator loss:     0.7980090760646906
--- EPOCH 4/100 ---
100%|█████████████████████████████████████████| 133/133 [00:35<00:00,  3.78it/s]
Discriminator loss: 1.3310418805681674
Generator loss:     0.8058272622581711
--- EPOCH 5/100 ---
100%|█████████████████████████████████████████| 133/133 [00:30<00:00,  4.40it/s]
Discriminator loss: 1.3612716928460544
Generator loss:     0.7818780236674431
--- EPOCH 6/100 ---
100%|█████████████████████████████████████████| 133/133 [00:28<00:00,  4.62it/s]
Discriminator loss: 1.3134378329255527
Generator loss:     0.7463640691642475
--- EPOCH 7/100 ---
100%|█████████████████████████████████████████| 133/133 [00:28<00:00,  4.69it/s]
Discriminator loss: 1.3780381191045719
Generator loss:     0.7370426462108928
--- EPOCH 8/100 ---
100%|█████████████████████████████████████████| 133/133 [00:34<00:00,  3.85it/s]
Discriminator loss: 1.3114121394946163
Generator loss:     0.8153696891508604
--- EPOCH 9/100 ---
100%|█████████████████████████████████████████| 133/133 [00:33<00:00,  3.92it/s]
Discriminator loss: 1.3627268837806874
Generator loss:     0.776933315105008
--- EPOCH 10/100 ---
100%|█████████████████████████████████████████| 133/133 [00:38<00:00,  3.46it/s]
Discriminator loss: 1.3678826972057945
Generator loss:     0.7519586512020656
--- EPOCH 11/100 ---
100%|█████████████████████████████████████████| 133/133 [00:35<00:00,  3.80it/s]
Discriminator loss: 1.3360447986681658
Generator loss:     0.7902616858482361
--- EPOCH 12/100 ---
100%|█████████████████████████████████████████| 133/133 [00:31<00:00,  4.21it/s]
Discriminator loss: 1.3116696490380997
Generator loss:     0.7913338891545633
--- EPOCH 13/100 ---
100%|█████████████████████████████████████████| 133/133 [00:32<00:00,  4.06it/s]
Discriminator loss: 1.2946408184847438
Generator loss:     0.8173314055105797
--- EPOCH 14/100 ---
100%|█████████████████████████████████████████| 133/133 [00:28<00:00,  4.64it/s]
Discriminator loss: 1.330717668945628
Generator loss:     0.8217142690393261
--- EPOCH 15/100 ---
100%|█████████████████████████████████████████| 133/133 [00:28<00:00,  4.70it/s]
Discriminator loss: 1.2794516014873534
Generator loss:     0.8819283063250377
--- EPOCH 16/100 ---
100%|█████████████████████████████████████████| 133/133 [00:31<00:00,  4.26it/s]
Discriminator loss: 1.3264111199773343
Generator loss:     0.8114233695922938
--- EPOCH 17/100 ---
100%|█████████████████████████████████████████| 133/133 [00:30<00:00,  4.42it/s]
Discriminator loss: 1.3732305074992932
Generator loss:     0.7831786958346689
--- EPOCH 18/100 ---
100%|█████████████████████████████████████████| 133/133 [00:32<00:00,  4.07it/s]
Discriminator loss: 1.3156394420709825
Generator loss:     0.8114110406180074
--- EPOCH 19/100 ---
100%|█████████████████████████████████████████| 133/133 [00:32<00:00,  4.07it/s]
Discriminator loss: 1.3141640612953587
Generator loss:     0.8553277307883241
--- EPOCH 20/100 ---
100%|█████████████████████████████████████████| 133/133 [00:29<00:00,  4.44it/s]
Discriminator loss: 1.3257308136251635
Generator loss:     0.8442430688922566
--- EPOCH 21/100 ---
100%|█████████████████████████████████████████| 133/133 [00:28<00:00,  4.71it/s]
Discriminator loss: 1.3306942053307267
Generator loss:     0.7536655525515851
--- EPOCH 22/100 ---
100%|█████████████████████████████████████████| 133/133 [00:28<00:00,  4.63it/s]
Discriminator loss: 1.2691815999665654
Generator loss:     0.928295197791623
--- EPOCH 23/100 ---
100%|█████████████████████████████████████████| 133/133 [00:27<00:00,  4.76it/s]
Discriminator loss: 1.3326885328256994
Generator loss:     0.7756251746550539
--- EPOCH 24/100 ---
100%|█████████████████████████████████████████| 133/133 [00:28<00:00,  4.65it/s]
Discriminator loss: 1.358824984919756
Generator loss:     0.7776076338793102
--- EPOCH 25/100 ---
100%|█████████████████████████████████████████| 133/133 [23:59<00:00, 10.82s/it]
Discriminator loss: 1.3053617580492693
Generator loss:     0.7909960948434988
--- EPOCH 26/100 ---
100%|█████████████████████████████████████████| 133/133 [00:36<00:00,  3.66it/s]
Discriminator loss: 1.348939643766647
Generator loss:     0.7909385526090636
--- EPOCH 27/100 ---
100%|█████████████████████████████████████████| 133/133 [00:39<00:00,  3.39it/s]
Discriminator loss: 1.3605027579723443
Generator loss:     0.7755582498428517
--- EPOCH 28/100 ---
100%|█████████████████████████████████████████| 133/133 [00:37<00:00,  3.56it/s]
Discriminator loss: 1.3628243584381907
Generator loss:     0.7874012177151845
--- EPOCH 29/100 ---
100%|█████████████████████████████████████████| 133/133 [00:30<00:00,  4.35it/s]
Discriminator loss: 1.3484721968048496
Generator loss:     0.7750240040004701
--- EPOCH 30/100 ---
100%|█████████████████████████████████████████| 133/133 [00:31<00:00,  4.18it/s]
Discriminator loss: 1.3944145772690164
Generator loss:     0.7293257641613036
--- EPOCH 31/100 ---
100%|█████████████████████████████████████████| 133/133 [00:32<00:00,  4.10it/s]
Discriminator loss: 1.3444358065612334
Generator loss:     0.7094572049782688
--- EPOCH 32/100 ---
100%|█████████████████████████████████████████| 133/133 [00:33<00:00,  4.00it/s]
Discriminator loss: 1.3868964551983023
Generator loss:     0.7158935560767812
--- EPOCH 33/100 ---
100%|█████████████████████████████████████████| 133/133 [00:31<00:00,  4.20it/s]
Discriminator loss: 1.371898485305614
Generator loss:     0.7572996813551824
--- EPOCH 34/100 ---
100%|█████████████████████████████████████████| 133/133 [00:30<00:00,  4.30it/s]
Discriminator loss: 1.3468890181161408
Generator loss:     0.7621907963788599
--- EPOCH 35/100 ---
100%|█████████████████████████████████████████| 133/133 [00:31<00:00,  4.19it/s]
Discriminator loss: 1.3391794244149573
Generator loss:     0.7658723104268985
--- EPOCH 36/100 ---
100%|█████████████████████████████████████████| 133/133 [00:30<00:00,  4.42it/s]
Discriminator loss: 1.3757759722551905
Generator loss:     0.7750920158131678
--- EPOCH 37/100 ---
100%|█████████████████████████████████████████| 133/133 [00:31<00:00,  4.21it/s]
Discriminator loss: 1.353216873075729
Generator loss:     0.7472609581803917
--- EPOCH 38/100 ---
100%|█████████████████████████████████████████| 133/133 [00:30<00:00,  4.35it/s]
Discriminator loss: 1.386530635948468
Generator loss:     0.6613477049465466
--- EPOCH 39/100 ---
100%|█████████████████████████████████████████| 133/133 [00:31<00:00,  4.24it/s]
Discriminator loss: 1.379711373408038
Generator loss:     0.6873996737308072
--- EPOCH 40/100 ---
100%|█████████████████████████████████████████| 133/133 [00:30<00:00,  4.31it/s]
Discriminator loss: 1.3612149953842163
Generator loss:     0.7055589677695941
--- EPOCH 41/100 ---
100%|█████████████████████████████████████████| 133/133 [00:31<00:00,  4.19it/s]
Discriminator loss: 1.3983838208635946
Generator loss:     0.6888383317710762
--- EPOCH 42/100 ---
100%|█████████████████████████████████████████| 133/133 [00:29<00:00,  4.53it/s]
Discriminator loss: 1.4074606949225403
Generator loss:     0.6762563035004121
--- EPOCH 43/100 ---
100%|█████████████████████████████████████████| 133/133 [00:28<00:00,  4.67it/s]
Discriminator loss: 1.3895688307912726
Generator loss:     0.7187479797162508
--- EPOCH 44/100 ---
100%|█████████████████████████████████████████| 133/133 [00:27<00:00,  4.92it/s]
Discriminator loss: 1.387302753620578
Generator loss:     0.697377728340321
--- EPOCH 45/100 ---
100%|█████████████████████████████████████████| 133/133 [00:28<00:00,  4.71it/s]
Discriminator loss: 1.3809221131461007
Generator loss:     0.7106191462143919
--- EPOCH 46/100 ---
100%|█████████████████████████████████████████| 133/133 [00:27<00:00,  4.90it/s]
Discriminator loss: 1.399391840275069
Generator loss:     0.6971177705248496
--- EPOCH 47/100 ---
100%|█████████████████████████████████████████| 133/133 [00:31<00:00,  4.20it/s]
Discriminator loss: 1.3900526726156248
Generator loss:     0.7189247890522605
--- EPOCH 48/100 ---
100%|█████████████████████████████████████████| 133/133 [00:27<00:00,  4.84it/s]
Discriminator loss: 1.3859154084571321
Generator loss:     0.7078896944684193
--- EPOCH 49/100 ---
100%|█████████████████████████████████████████| 133/133 [00:29<00:00,  4.50it/s]
Discriminator loss: 1.3643943394037117
Generator loss:     0.7274110178302106
--- EPOCH 50/100 ---
100%|█████████████████████████████████████████| 133/133 [00:27<00:00,  4.82it/s]
Discriminator loss: 1.3860085512462414
Generator loss:     0.6928415022846451
--- EPOCH 51/100 ---
100%|█████████████████████████████████████████| 133/133 [00:30<00:00,  4.42it/s]
Discriminator loss: 1.377233802824092
Generator loss:     0.6874544288879051
--- EPOCH 52/100 ---
100%|█████████████████████████████████████████| 133/133 [00:31<00:00,  4.29it/s]
Discriminator loss: 1.3724496741043894
Generator loss:     0.7047534024805054
--- EPOCH 53/100 ---
100%|█████████████████████████████████████████| 133/133 [00:30<00:00,  4.41it/s]
Discriminator loss: 1.3929707932292967
Generator loss:     0.7077379930288272
--- EPOCH 54/100 ---
100%|█████████████████████████████████████████| 133/133 [00:30<00:00,  4.39it/s]
Discriminator loss: 1.3775214942774379
Generator loss:     0.7026135545027884
--- EPOCH 55/100 ---
100%|█████████████████████████████████████████| 133/133 [00:32<00:00,  4.10it/s]
Discriminator loss: 1.3894198971583431
Generator loss:     0.7005884270918997
--- EPOCH 56/100 ---
100%|█████████████████████████████████████████| 133/133 [00:29<00:00,  4.44it/s]
Discriminator loss: 1.3808675179804177
Generator loss:     0.7141908115910408
--- EPOCH 57/100 ---
100%|█████████████████████████████████████████| 133/133 [00:30<00:00,  4.38it/s]
Discriminator loss: 1.3837377415563827
Generator loss:     0.7051679801223869
--- EPOCH 58/100 ---
100%|█████████████████████████████████████████| 133/133 [00:28<00:00,  4.64it/s]
Discriminator loss: 1.3683975701941584
Generator loss:     0.7369146907239928
--- EPOCH 59/100 ---
100%|█████████████████████████████████████████| 133/133 [00:30<00:00,  4.40it/s]
Discriminator loss: 1.3829459660035326
Generator loss:     0.6538807772155991
--- EPOCH 60/100 ---
100%|█████████████████████████████████████████| 133/133 [00:33<00:00,  3.95it/s]
Discriminator loss: 1.3677203225013905
Generator loss:     0.7293664266292313
--- EPOCH 61/100 ---
100%|█████████████████████████████████████████| 133/133 [00:31<00:00,  4.18it/s]
Discriminator loss: 1.3800513860874606
Generator loss:     0.7348918784829906
--- EPOCH 62/100 ---
100%|█████████████████████████████████████████| 133/133 [00:28<00:00,  4.73it/s]
Discriminator loss: 1.3744387053009262
Generator loss:     0.7359270601344288
--- EPOCH 63/100 ---
100%|█████████████████████████████████████████| 133/133 [00:33<00:00,  4.01it/s]
Discriminator loss: 1.373998192916239
Generator loss:     0.7276788197065654
--- EPOCH 64/100 ---
100%|█████████████████████████████████████████| 133/133 [00:32<00:00,  4.14it/s]
Discriminator loss: 1.3920388490633857
Generator loss:     0.7200995710559357
--- EPOCH 65/100 ---
100%|█████████████████████████████████████████| 133/133 [00:30<00:00,  4.36it/s]
Discriminator loss: 1.369957351146784
Generator loss:     0.7154990430165055
--- EPOCH 66/100 ---
100%|█████████████████████████████████████████| 133/133 [00:32<00:00,  4.09it/s]
Discriminator loss: 1.3607412388450222
Generator loss:     0.7023641113051795
--- EPOCH 67/100 ---
100%|█████████████████████████████████████████| 133/133 [00:33<00:00,  4.02it/s]
Discriminator loss: 1.3709515377991182
Generator loss:     0.7412473272560234
--- EPOCH 68/100 ---
100%|█████████████████████████████████████████| 133/133 [00:31<00:00,  4.28it/s]
Discriminator loss: 1.3542222170005167
Generator loss:     0.7531414652677407
--- EPOCH 69/100 ---
100%|█████████████████████████████████████████| 133/133 [00:33<00:00,  3.98it/s]
Discriminator loss: 1.3692521493237717
Generator loss:     0.7447243252194914
--- EPOCH 70/100 ---
100%|█████████████████████████████████████████| 133/133 [00:30<00:00,  4.29it/s]
Discriminator loss: 1.3937095358855742
Generator loss:     0.7002077281923222
--- EPOCH 71/100 ---
100%|█████████████████████████████████████████| 133/133 [00:31<00:00,  4.16it/s]
Discriminator loss: 1.392579540274197
Generator loss:     0.6932471712729088
--- EPOCH 72/100 ---
100%|█████████████████████████████████████████| 133/133 [00:30<00:00,  4.38it/s]
Discriminator loss: 1.3606340670047845
Generator loss:     0.6910830140113831
--- EPOCH 73/100 ---
100%|█████████████████████████████████████████| 133/133 [00:30<00:00,  4.32it/s]
Discriminator loss: 1.380547304798786
Generator loss:     0.7060410550662449
--- EPOCH 74/100 ---
100%|█████████████████████████████████████████| 133/133 [00:30<00:00,  4.41it/s]
Discriminator loss: 1.3912733766369354
Generator loss:     0.6862335025816035
--- EPOCH 75/100 ---
100%|█████████████████████████████████████████| 133/133 [00:32<00:00,  4.09it/s]
Discriminator loss: 1.3533304427799426
Generator loss:     0.7621926060296539
--- EPOCH 76/100 ---
100%|█████████████████████████████████████████| 133/133 [00:29<00:00,  4.46it/s]
Discriminator loss: 1.3760721602834256
Generator loss:     0.71099306004388
--- EPOCH 77/100 ---
100%|█████████████████████████████████████████| 133/133 [00:31<00:00,  4.25it/s]
Discriminator loss: 1.375013388189158
Generator loss:     0.7724691799708775
--- EPOCH 78/100 ---
100%|█████████████████████████████████████████| 133/133 [00:31<00:00,  4.25it/s]
Discriminator loss: 1.3783018929617745
Generator loss:     0.7241247108108119
--- EPOCH 79/100 ---
100%|█████████████████████████████████████████| 133/133 [00:33<00:00,  3.93it/s]
Discriminator loss: 1.3538859948179776
Generator loss:     0.7574535751701298
--- EPOCH 80/100 ---
100%|█████████████████████████████████████████| 133/133 [00:33<00:00,  3.97it/s]
Discriminator loss: 1.3742202932673289
Generator loss:     0.7243723779692686
--- EPOCH 81/100 ---
100%|█████████████████████████████████████████| 133/133 [00:42<00:00,  3.10it/s]
Discriminator loss: 1.363932890999586
Generator loss:     0.7276730062370014
--- EPOCH 82/100 ---
100%|█████████████████████████████████████████| 133/133 [00:37<00:00,  3.53it/s]
Discriminator loss: 1.3647771491143936
Generator loss:     0.7565775791505226
--- EPOCH 83/100 ---
100%|█████████████████████████████████████████| 133/133 [00:32<00:00,  4.09it/s]
Discriminator loss: 1.38117810299522
Generator loss:     0.7473030493671733
--- EPOCH 84/100 ---
100%|█████████████████████████████████████████| 133/133 [00:32<00:00,  4.14it/s]
Discriminator loss: 1.3946288716524167
Generator loss:     0.6966705824199476
--- EPOCH 85/100 ---
100%|█████████████████████████████████████████| 133/133 [00:31<00:00,  4.23it/s]
Discriminator loss: 1.3750159507407282
Generator loss:     0.7198978205372516
--- EPOCH 86/100 ---
100%|█████████████████████████████████████████| 133/133 [00:31<00:00,  4.27it/s]
Discriminator loss: 1.3447362444454567
Generator loss:     0.74971271054189
--- EPOCH 87/100 ---
100%|█████████████████████████████████████████| 133/133 [00:29<00:00,  4.48it/s]
Discriminator loss: 1.3901844266662025
Generator loss:     0.7170487498878536
--- EPOCH 88/100 ---
100%|█████████████████████████████████████████| 133/133 [00:30<00:00,  4.30it/s]
Discriminator loss: 1.356699283857991
Generator loss:     0.7160336357310302
--- EPOCH 89/100 ---
100%|█████████████████████████████████████████| 133/133 [00:31<00:00,  4.26it/s]
Discriminator loss: 1.3760145544109488
Generator loss:     0.7182338896550631
--- EPOCH 90/100 ---
100%|█████████████████████████████████████████| 133/133 [00:33<00:00,  4.03it/s]
Discriminator loss: 1.3508277844665642
Generator loss:     0.7397988003895695
--- EPOCH 91/100 ---
100%|█████████████████████████████████████████| 133/133 [00:29<00:00,  4.49it/s]
Discriminator loss: 1.3720417040631288
Generator loss:     0.7488065503145519
--- EPOCH 92/100 ---
100%|█████████████████████████████████████████| 133/133 [00:29<00:00,  4.44it/s]
Discriminator loss: 1.342560267089901
Generator loss:     0.7304007343779829
--- EPOCH 93/100 ---
100%|█████████████████████████████████████████| 133/133 [00:29<00:00,  4.55it/s]
Discriminator loss: 1.3711320430712592
Generator loss:     0.7673848535781517
--- EPOCH 94/100 ---
100%|█████████████████████████████████████████| 133/133 [00:28<00:00,  4.72it/s]
Discriminator loss: 1.3838464720804888
Generator loss:     0.6711053323924989
--- EPOCH 95/100 ---
100%|█████████████████████████████████████████| 133/133 [00:31<00:00,  4.27it/s]
Discriminator loss: 1.4033516260018026
Generator loss:     0.6842943461317765
--- EPOCH 96/100 ---
100%|█████████████████████████████████████████| 133/133 [00:30<00:00,  4.42it/s]
Discriminator loss: 1.380405118590907
Generator loss:     0.72748539411932
--- EPOCH 97/100 ---
100%|█████████████████████████████████████████| 133/133 [00:30<00:00,  4.29it/s]
Discriminator loss: 1.377022463576238
Generator loss:     0.7070425565081432
--- EPOCH 98/100 ---
100%|█████████████████████████████████████████| 133/133 [00:30<00:00,  4.30it/s]
Discriminator loss: 1.3909148614209397
Generator loss:     0.7345977582429585
--- EPOCH 99/100 ---
100%|█████████████████████████████████████████| 133/133 [00:29<00:00,  4.48it/s]
Discriminator loss: 1.3459629752582176
Generator loss:     0.7756604776346594
--- EPOCH 100/100 ---
100%|█████████████████████████████████████████| 133/133 [00:29<00:00,  4.48it/s]
Discriminator loss: 1.3605147465727383
Generator loss:     0.7258370196012626
In [13]:
# Plot images from best or last model
if os.path.isfile(f'{checkpoint_file}.pt'):
    gen = torch.load(f'{checkpoint_file}.pt', map_location=device)
print('*** Images Generated from best model:')
samples = gen.sample(n=15, with_grad=False).cpu()
fig, _ = plot.tensors_as_images(samples, nrows=3, figsize=(6,6))
*** Images Generated from best model:

Questions¶

TODO Answer the following questions. Write your answers in the appropriate variables in the module hw4/answers.py.

In [14]:
from cs236781.answers import display_answer
import hw4.answers as answers

Question 1¶

Explain in detail why during training we sometimes need to maintain gradients when sampling from the GAN, and other times we don't. When are they maintained and why? When are they discarded and why?

In [15]:
display_answer(answers.part3_q1)

Your answer:

Write your answer using markdown and $\LaTeX$:

# A code block
a = 2

An equation: $e^{i\pi} -1 = 0$

Question 2¶

  1. When training a GAN to generate images, should we decide to stop training solely based on the fact that the Generator loss is below some threshold? Why or why not?

  2. What does it mean if the discriminator loss remains at a constant value while the generator loss decreases?

In [16]:
display_answer(answers.part3_q2)

Your answer:

Write your answer using markdown and $\LaTeX$:

# A code block
a = 2

An equation: $e^{i\pi} -1 = 0$

Question 3¶

Compare the results you got when generating images with the VAE to the GAN results. What's the main difference and what's causing it?

In [17]:
display_answer(answers.part3_q3)

Your answer:

Write your answer using markdown and $\LaTeX$:

# A code block
a = 2

An equation: $e^{i\pi} -1 = 0$

$$ \newcommand{\mat}[1]{\boldsymbol {#1}} \newcommand{\mattr}[1]{\boldsymbol {#1}^\top} \newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}} \newcommand{\vec}[1]{\boldsymbol {#1}} \newcommand{\vectr}[1]{\boldsymbol {#1}^\top} \newcommand{\rvar}[1]{\mathrm {#1}} \newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}} \newcommand{\diag}{\mathop{\mathrm {diag}}} \newcommand{\set}[1]{\mathbb {#1}} \newcommand{\cset}[1]{\mathcal{#1}} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\bb}[1]{\boldsymbol{#1}} \newcommand{\E}[2][]{\mathbb{E}_{#1}\left[#2\right]} \newcommand{\ip}[3]{\left<#1,#2\right>_{#3}} \newcommand{\given}[]{\,\middle\vert\,} \newcommand{\DKL}[2]{\cset{D}_{\text{KL}}\left(#1\,\Vert\, #2\right)} \newcommand{\grad}[]{\nabla} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} $$

Part 4: Summary Questions¶

This section contains summary questions about various topics from the course material.

You can add your answers in new cells below the questions.

Notes

  • Clearly mark where your answer begins, e.g. write "Answer:" in the beginning of your cell.
  • Provide a full explanation, even if the question doesn't explicitly state so. We will reduce points for partial explanations!
  • This notebook should be runnable from start to end without any errors.

CNNs¶

  1. Explain the meaning of the term "receptive field" in the context of CNNs.

Answer:

Receptive field is the region in the input space which produces the feature in the following layers of the CNN.

  1. Explain and elaborate about three different ways to control the rate at which the receptive field grows from layer to layer. Compare them to each other in terms of how they combine input features.

Answer:

  1. Increasing the number of convolutional layers. Each extra layer increases the receptive field size by the kernel size.

  2. Adding pooling layers, which also reduce the dimensions of the feature maps. Thus, it reduces the number of parameters to learn and the amount of computation performed in the network. The pooling layer summarises the features present in a region of the feature map generated by a convolution layer. They increases the receptive field size multiplicatively.

  3. Dilated convolutions. They introduce spacing between the values of a convolutional kernel, the number of weights in the kernel is unchanged. Increase the receptive field exponentially.

  1. Imagine a CNN with three convolutional layers, defined as follows:
In [1]:
import torch
import torch.nn as nn

cnn = nn.Sequential(
    nn.Conv2d(in_channels=3, out_channels=4, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Conv2d(in_channels=4, out_channels=16, kernel_size=5, stride=2, padding=2),
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Conv2d(in_channels=16, out_channels=32, kernel_size=7, dilation=2, padding=3),
    nn.ReLU(),
)

cnn(torch.rand(size=(1, 3, 1024, 1024), dtype=torch.float32)).shape
Out[1]:
torch.Size([1, 32, 122, 122])

What is the size (spatial extent) of the receptive field of each "pixel" in the output tensor?

Answer:

Using the fromulas:

$$ r_{out} = r_{in}+((k_{in} - 1) * \prod_{i=1}^{k-1}s_i) $$

Kernels with dilation:

$$ k_{prev} = r * (k - 1) + 1 $$

We will get:

Layer 1 (Conv2D): $$ k = 3, s = 1 $$ $$ R_1 = 1 + (3 - 1)*1 = 3 $$

Layer 2 (Pooloing): $$ R_2 = R_1 + (2 - 1) * 2 = 5 $$

Layer 3 (Conv2D): $$ R_3 = R_2 + (5 - 1) * 2^2 = 21 $$

Layer 4 (Pooloing): $$ R_4 = R_3 + (2 - 1) * 2^3 = 29 $$

Layer 5 (Conv2D): $$ R_5 = R_4 + (13 - 1) * 2^3 = 125 $$

The size of the receptive field of each "pixel" in the output tensor is [125 x 125]

  1. You have trained a CNN, where each layer $l$ is represented by the mapping $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)$, and $f_l(\cdot;\vec{\theta}_l)$ is a convolutional layer (not including the activation function).

    After hearing that residual networks can be made much deeper, you decide to change each layer in your network you used the following residual mapping instead $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)+\vec{x}$, and re-train.

    However, to your surprise, by visualizing the learned filters $\vec{\theta}_l$ you observe that the original network and the residual network produce completely different filters. Explain the reason for this.

Answer:

In residual networks we use skip connections to sum the output of a layer with the input of another deeper layer (after skiping few layers), this allows to create deeper networks and solves the problem of vanishing gradients. Optimization process is different, it results in different filters.

Dropout¶

  1. True or false: dropout must be placed only after the activation function.

Answer:

False. It doesn't metter where it placed since it sets a fraction of units to be zero.

  1. After applying dropout with a drop-probability of $p$, the activations are scaled by $1/(1-p)$. Prove that this scaling is required in order to maintain the value of each activation unchanged in expectation.

Answer:

When we apply the dropout only a fraction of the neurons is activated during training, while during the test we will activate all of them, thats why we need to do scaling to compensate.

Losses and Activation functions¶

  1. You're training a an image classifier that, given an image, needs to classify it as either a dog (output 0) or a hotdog (output 1). Would you train this model with an L2 loss? if so, why? if not, demonstrate with a numerical example. What would you use instead?

Answer:

The default loss function used for classification task is binary cross-entropy, which maximizes the likelihood of classification. L2 loss measures the squered error between the prediction and the label, it's the default loss function for regression tasks.

$L_2 loss = \sum_{i=0}^{N} (y_i-y_i^{pred})^2$

$BCE loss = \frac{1}{N} \sum_{i=i}^{N} -(y_i\cdot log(p_i) + (1-y_i)\cdot log(1-p_i))$

  1. After months of research into the origins of climate change, you observe the following result:

You decide to train a cutting-edge deep neural network regression model, that will predict the global temperature based on the population of pirates in N locations around the globe. You define your model as follows:

In [ ]:
import torch.nn as nn

N = 42  # number of known global pirate hot spots
H = 128
mlpirate = nn.Sequential(
    nn.Linear(in_features=N, out_features=H),
    nn.Sigmoid(),
    *[
        nn.Linear(in_features=H, out_features=H),
        nn.Sigmoid(),
    ]*N,
    nn.Linear(in_features=H, out_features=1),
)

While training your model you notice that the loss reaches a plateau after only a few iterations. It seems that your model is no longer training. What is the most likely cause?

Answer:

The model is 42 layers deep and has no skip connections, also it uses sigmoid activation, which is good only for the final layer. The model is no longer training due to the vanishing gradients.

  1. Referring to question 2 above: A friend suggests that if you replace the sigmoid activations with tanh, it will solve your problem. Is he correct? Explain why or why not.

Answer:

The range of a sigmoid is [0 ,1], the range of a tanh is [-1 ,1], it can help a little, but the model is so deep that it doesn't look like this increase can solve the problem of vanishing gradients.

  1. Regarding the ReLU activation, state whether the following sentences are true or false and explain: A. In a model using exclusively ReLU activations, there can be no vanishing gradients.

Answer: True. But we still can get zero-nodes.

B. The gradient of ReLU is linear with its input when the input is positive.



Answer: False. The gradient is constant and equals to 1.

C. ReLU can cause "dead" neurons, i.e. activations that remain at a constant value of zero.



Answer: True. For negative values it will be zero.

Optimization¶

  1. Explain the difference between: stochastic gradient descent (SGD), mini-batch SGD and regular gradient descent (GD).

Answer: In GD the whole dataset used to calculate the average loss and make an update to the weights.

In Mini-batch SGD the loss calculation and weights update are made for a fraction of a dataset at each time. It takes one epoch to go over whole training set.

Stochastic gradient descent (SGD) calculates gradient for one point and backpropogates.

  1. Regarding SGD and GD:
    1. Provide at least two reasons for why SGD is used more often in practice compared to GD.
    2. In what cases can GD not be used at all?

Answer:

A. 1 - Memory is limited. 2 - Can stuck at local minimum.

B. When the dataset is too big.

  1. You have trained a deep resnet to obtain SoTA results on ImageNet. While training using mini-batch SGD with a batch size of $B$, you noticed that your model converged to a loss value of $l_0$ within $n$ iterations (batches across all epochs) on average. Thanks to your amazing results, you secure funding for a new high-powered server with GPUs containing twice the amount of RAM. You're now considering to increase the mini-batch size from $B$ to $2B$. Would you expect the number of of iterations required to converge to $l_0$ to decrease or increase when using the new batch size? explain in detail.

Answer:

It's difficult to know. On the one hand, we expect the number of iterations to decrease, because in each batch we are averaging over more samples, on the other hand too large batch size can lead to poor generalization.

  1. For each of the following statements, state whether they're true or false and explain why.
    1. When training a neural network with SGD, every epoch we perform an optimization step for each sample in our dataset.
    2. Gradients obtained with SGD have less variance and lead to quicker convergence compared to GD.
    3. SGD is less likely to get stuck in local minima, compared to GD.
    4. Training with SGD requires more memory than with GD.
    5. Assuming appropriate learning rates, SGD is guaranteed to converge to a local minimum, while GD is guaranteed to converge to the global minimum.
    6. Given a loss surface with a narrow ravine (high curvature in one direction): SGD with momentum will converge more quickly than Newton's method which doesn't have momentum.

Answer:

A. True. For every update we consider one sample.

B. False. Since we calculate for one sample at a time gradients are more affected by noise.

C. True. There is no chance that every sample in SGD will lead to the same local minimum.

D. False. In SGD we need a memomry to store only one sample.

E. False. We can't guarantee that.

F. False. Even though momentum prevents from SGD to oscilate in a narrow ravine, but in Newton's method the second derivative improves convergence more effectively.

  1. In tutorial 5 we saw an example of bi-level optimization in the context of deep learning, by embedding an optimization problem as a layer in the network.
    1. True or false: In order to train such a network, the inner optimization problem must be solved with a descent based method (such as SGD, LBFGS, etc). Provide a mathematical justification for your answer.

Answer:

False: In tutorial we saw that there are cases when minimum can be found without using descent minimum, by analytical solution.

  1. You have trained a neural network, where each layer $l$ is represented by the mapping $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)$ for some arbitrary parametrized functions $f_l(\cdot;\vec{\theta}_l)$. Unfortunately while trying to break the record for the world's deepest network, you discover that you are unable to train your network with more than $L$ layers.
    1. Explain the concepts of "vanishing gradients", and "exploding gradients".
    2. How can each of these problems be caused by increased depth?
    3. Provide a numerical example demonstrating each.
    4. Assuming your problem is either of these, how can you tell which of them it is without looking at the gradient tensor(s)?

Answer:

A. Vanishing gradients happens when icreasingly small gradients backpropogate through the network for the update. Activation functions with plateu like sigmoid and tanh may lead to this problem. When we multiply low values by the chain rule multiple times the gradient becomes zero. Exploding gradients caused by very large derivative, the model becomes unstable.

B. Due to the chain rule.

C. If we assume 3 layer CNN, by the chain rule, for the first layer:

$\frac{d(f(f(f(x)))}{d(x)} = \frac{d(f(f(f(x)))}{d(f(f(x)))} \cdot \frac{df(f(x))}{d(f(x))} \cdot \frac{df(x)}{d(x))}$

If activation function is a power of high or low order, the gradints will explode or vanish respectively.

D. The loss will reach the plateau if the gradients are vanishing and oscilate if the gradients are exploding.

Backpropagation¶

  1. You wish to train the following 2-layer MLP for a binary classification task: $$ \hat{y}^{(i)} =\mat{W}_2~ \varphi(\mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1) + \vec{b}_2 $$ Your wish to minimize the in-sample loss function is defined as $$ L_{\mathcal{S}} = \frac{1}{N}\sum_{i=1}^{N}\ell(y^{(i)},\hat{y}^{(i)}) + \frac{\lambda}{2}\left(\norm{\mat{W}_1}_F^2 + \norm{\mat{W}_2}_F^2 \right) $$ Where the pointwise loss is binary cross-entropy: $$ \ell(y, \hat{y}) = - y \log(\hat{y}) - (1-y) \log(1-\hat{y}) $$

    Write an analytic expression for the derivative of the final loss $L_{\mathcal{S}}$ w.r.t. each of the following tensors: $\mat{W}_1$, $\mat{W}_2$, $\mat{b}_1$, $\mat{b}_2$, $\mat{x}$.

  1. Given the following code snippet, implement the custom backward function part4_affine_backward in hw4/answers.py so that it passes the asserts.
In [ ]:
from torch.autograd import Function

from hw4.answers import part4_affine_backward

N, d_in, d_out = 100, 11, 7
dtype = torch.float64
X = torch.rand(N, d_in, dtype=dtype)
W = torch.rand(d_out, d_in, requires_grad=True, dtype=dtype)
b = torch.rand(d_out, requires_grad=True, dtype=dtype)

def affine(X, W, b):
    return 0.5 * X @ W.T + b

class AffineLayerFunction(Function):
    @staticmethod
    def forward(ctx, X, W, b):
        result = affine(X, W, b)
        ctx.save_for_backward(X, W, b)
        return result

    @staticmethod
    def backward(ctx, grad_output):
        return part4_affine_backward(ctx, grad_output)

l1 = torch.sum(AffineLayerFunction.apply(X, W, b))
l1.backward()
W_grad1 = W.grad
b_grad1 = b.grad

l2 = torch.sum(affine(X, W, b))
W.grad = b.grad = None
l2.backward()
W_grad2 = W.grad
b_grad2 = b.grad

assert torch.allclose(W_grad1, W_grad2)
assert torch.allclose(b_grad1, b_grad2)

Sequence models¶

  1. Regarding word embeddings:
    1. Explain this term and why it's used in the context of a language model.
    2. Can a language model like the sentiment analysis example from the tutorials be trained without an embedding (i.e. trained directly on sequences of tokens)? If yes, what would be the consequence for the trained model? if no, why not?

Answer:

A. Word embeddings is a way to represent words as a tensor while the words that are close in the tensor space are expected to have similar meaning. It's used in language model to allow words to be processed by it and perform calculations.

B. No, because we need a way to make numerical calculations.

  1. Considering the following snippet, explain:
    1. What does Y contain? why this output shape?
    2. How you would implement nn.Embedding yourself using only torch tensors.
In [ ]:
import torch.nn as nn

X = torch.randint(low=0, high=42, size=(5, 6, 7, 8))
embedding = nn.Embedding(num_embeddings=42, embedding_dim=42000)
Y = embedding(X)
print(f"{Y.shape=}")

Answer:

A. Y contains embeddings of size of X with extra dimension of 42000.

B. We can write: embedding = torch.rand(size=(num_embeddings, embedding_dim)) and then: Y = torch.gather(input=embedding, dim=0, index=X).

  1. Regarding truncated backpropagation through time (TBPTT) with a sequence length of $S$: State whether the following sentences are true or false, and explain.
    1. TBPTT uses a modified version of the backpropagation algorithm.
    2. To implement TBPTT we only need to limit the length of the sequence provided to the model to length $S$.
    3. TBPTT allows the model to learn relations between input that are at most $S$ timesteps apart.

Answer:

A. True. Herea losses are accumulated, and then the update made by using the accumulated gradients from all timesteps.

B. False. Input remains the same, the sequence for backpropogation changes.

C. False. We can learn relations between input that are more than S timesteps, we keep the hidden state of the previous sequence, which used then in the next sequence, so the output will depend on all timesteps.

Attention¶

  1. In tutorial 7 (part 2) we learned how to use attention to perform alignment between a source and target sequence in machine translation.

    1. Explain qualitatively what the addition of the attention mechanism between the encoder and decoder does to the hidden states that the encoder and decoder each learn to generate (for their language). How are these hidden states different from the model without attention?
    1. After learning that self-attention is gaining popularity thanks to the shiny new transformer models, you decide to change the model from the tutorial: instead of the queries being equal to the decoder hidden states, you use self-attention, so that the keys, queries and values are all equal to the encoder's hidden states (with learned projections). What influence do you expect this will have on the learned hidden states?

Unsupervised learning¶

  1. As we have seen, a variational autoencoder's loss is comprised of a reconstruction term and a KL-divergence term. While training your VAE, you accidentally forgot to include the KL-divergence term. What would be the qualitative effect of this on:

    1. Images reconstructed by the model during training ($x\to z \to x'$)?
    2. Images generated by the model ($z \to x'$)?
  1. Regarding VAEs, state whether each of the following statements is true or false, and explain:
    1. The latent-space distribution generated by the model for a specific input image is $\mathcal{N}(\vec{0},\vec{I})$.
    2. If we feed the same image to the encoder multiple times, then decode each result, we'll get the same reconstruction.
    3. Since the real VAE loss term is intractable, what we actually minimize instead is it's upper bound, in the hope that the bound is tight.
  1. Regarding GANs, state whether each of the following statements is true or false, and explain:
    1. Ideally, we want the generator's loss to be low, and the discriminator's loss to be high so that it's fooled well by the generator.
    2. It's crucial to backpropagate into the generator when training the discriminator.
    3. To generate a new image, we can sample a latent-space vector from $\mathcal{N}(\vec{0},\vec{I})$.
    4. It can be beneficial for training the generator if the discriminator is trained for a few epochs first, so that it's output isn't arbitrary.
      1. If the generator is generating plausible images and the discriminator reaches a stable state where it has 50% accuracy (for both image types), training the generator more will further improve the generated images.

Graph Neural Networks¶

  1. You have implemented a graph convolutional layer based on the following formula, for a graph with $N$ nodes: $$ \mat{Y}=\varphi\left( \sum_{k=1}^{q} \mat{\Delta}^k \mat{X} \mat{\alpha}_k + \vec{b} \right). $$
    1. Assuming $\mat{X}$ is the input feature matrix of shape $(N, M)$: what does $\mat{Y}$ contain in it's rows?
    2. Unfortunately, due to a bug in your calculation of the Laplacian matrix, you accidentally zeroed the row and column $i=j=5$ (assume more than 5 nodes in the graph). What would be the effect of this bug on the output of your layer, $\mat{Y}$?
  1. We have discussed the notion of a Receptive Field in the context of a CNN. How would you define a similar concept in the context of a GCN (i.e. a model comprised of multiple graph convolutional layers)?